Inferring Phonemic Classes from CNN Activation Maps Using Clustering Techniques

Thomas Pellegrini, Sandrine Mouysset


Today’s state-of-art in speech recognition involves deep neural networks (DNN). These last years, a certain research effort has been invested in characterizing the feature representations learned by DNNs. In this paper, we focus on convolutional neural networks (CNN) trained for phoneme recognition in French. We report clustering experiments performed on activation maps extracted from the different layers of a CNN comprised of two convolution and sub-sampling layers followed by three dense layers. Our goal was to get insights into phone separability and phonemic categories inferred by the network, and how they vary according to the successive layers. Two directions were explored with both linear and non-linear clustering techniques. First, we imposed a number of 33 classes equal to the number of context-independent phone models for French, in order to assess the phoneme separability power of the different layers. As expected, we observed that this power increases with the layer depth in the network: from 34% to 74% in F-measure from the first convolution to the last dense layers, when using spectral clustering. Second, optimal numbers of classes were automatically inferred through inter- and intra-cluster measure criteria. We analyze these classes in terms of standard French phonological features.


DOI: 10.21437/Interspeech.2016-1299

Cite as

Pellegrini, T., Mouysset, S. (2016) Inferring Phonemic Classes from CNN Activation Maps Using Clustering Techniques. Proc. Interspeech 2016, 1290-1294.

Bibtex
@inproceedings{Pellegrini+2016,
author={Thomas Pellegrini and Sandrine Mouysset},
title={Inferring Phonemic Classes from CNN Activation Maps Using Clustering Techniques},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1299},
url={http://dx.doi.org/10.21437/Interspeech.2016-1299},
pages={1290--1294}
}