We investigate methods for unsupervised learning of sub-word acoustic units of a language directly from speech. Earlier we demonstrated that the states of a hidden Markov model "grown" using a novel modification of the maximum likelihood successive state splitting algorithm correspond very well with the phones of the language. The correspondence between the Viterbi state sequence for unseen speech from the training speaker and the phone transcription of the speech is over 85%, and generalizes to a large extent (~ 61%) to speech from a different speaker. Furthermore, we are able to bridge more than half the gap between the speaker-dependent and cross-speaker correspondence of the automatically learned units to phones (~ 73% accuracy) by unsupervised adaptation via MLLR.
Bibliographic reference. Varadarajan, Balakrishnan / Khudanpur, Sanjeev (2008): "Automatically learning speaker-independent acoustic subword units", In INTERSPEECH-2008, 1333-1336.