Unsupervised acoustic modeling is an important and challenging problem in spoken language technology development for low-resource languages. It aims at automatically learning a set of speech units from un-transcribed data. These learned units are expected to be related to fundamental linguistic units that constitute the concerned language. Formulated as a clustering problem, unsupervised acoustic modeling methods are often evaluated in terms of average purity or similar types of performance measures. They do not provide detailed insights on the fitness of individual learned units and the relation between them. This paper presents an investigation on the linguistic relevance of learned speech units based on Kullback-Leibler (KL) divergence. A symmetric KL divergence metric is used to measure the distance between each pair of learned unit and ground-truth phoneme of the target language. Experimental analysis on a multilingual database shows that KL divergence is consistent with purity in evaluating clustering results. The deviation between a learned unit and its closest ground-truth phoneme is comparable to the inherent variability of the phoneme. The learned speech units have a good coverage of linguistically defined phonemes. However, there are certain phonemes that can not be covered, for example, the retroflex final /er/ in Mandarin.
Cite as: Feng, S., Lee, T. (2017) On the Linguistic Relevance of Speech Units Learned by Unsupervised Acoustic Modeling. Proc. Interspeech 2017, 2068-2072, doi: 10.21437/Interspeech.2017-300
@inproceedings{feng17_interspeech, author={Siyuan Feng and Tan Lee}, title={{On the Linguistic Relevance of Speech Units Learned by Unsupervised Acoustic Modeling}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2068--2072}, doi={10.21437/Interspeech.2017-300} }