15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Speech Recognition Without a Lexicon — Bridging the Gap Between Graphemic and Phonetic Systems

David Harwath, James R. Glass


Modern speech recognizers rely on three core components: an acoustic model, a language model, and a pronunciation lexicon. In order to expand speech recognition capability to low-resource languages and domains, techniques to peel away the expert knowledge required to craft these three components have been growing in popularity. In this paper, we present a method for automatically learning a weighted pronunciation lexicon in a data-driven fashion without assuming the existence of any phonetic lexicon whatsoever. Given an initial grapheme acoustic model, our method utilizes a novel technique for semi-constrained acoustic unit decoding, which is used to help train a letter to sound (L2S) model. The L2S model is then used in conjunction with a Pronunciation Mixture Model (PMM) to infer a pronunciation lexicon. We evaluate our method on English as well as Lao and Haitian, two low-resource languages featured in the IARPA Babel program.

Full Paper

Bibliographic reference.  Harwath, David / Glass, James R. (2014): "Speech recognition without a lexicon — bridging the gap between graphemic and phonetic systems", In INTERSPEECH-2014, 2655-2659.