This paper proposes a method for the unsupervised learning of lexicons from pairs of a spoken utterance and an object as its meaning without any a priori linguistic knowledge other than a phoneme acoustic model. In order to obtain a lexicon, a statistical model of the joint probability of a spoken utterance and an object is learned based on the minimum description length principle. This model consists of a list of word phoneme sequences and three statistical models: the phoneme acoustic model, a word-bigram model, and a word meaning model. Experimental results show that the method can acquire acoustically, grammatically and semantically appropriate words with about 85% phoneme accuracy.
Bibliographic reference. Taguchi, Ryo / Iwahashi, Naoto / Nose, Takashi / Funakoshi, Kotaro / Nakano, Mikio (2009): "Learning lexicons from spoken utterances based on statistical model selection", In INTERSPEECH-2009, 2731-2734.