Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Stochastic Pronunciation Modeling by Ergodic-HMM of Acoustic Sub-Word Units

V. Ramasubramanian, P. Srinivas, T. V. Sreenivas

Indian Institute of Science, India

We propose a stochastic pronunciation model using an ergodic - hidden Markov model (EHMM) of automatically derived acoustic sub-word units (SWU). The proposed EHMM discovers the pronunciation structure inherent in the acoustic training data of a word without any apriori phonetic transcriptions. The EHMM is an HMM of HMMs - its states are SWU HMMs and the state-transitions compose various possible lexicon. The EHMM parameters are estimated by an iterative segmental K-means procedure which jointly optimizes the subword units (states) and the pronunciation structure parameters (state-transitions). The EHMM based pronunciation model is evaluated in an English isolated word recognition task with 70 speakers drawn from 8 highly different first languages. Results show that EHMM learns the lexicon distribution over the population of speakers for each word, thereby effectively modeling the inter-speaker pronunciation variability. EHMM offers an improvement of 8% (absolute) word recognition accuracy over a single most likely lexicon performance.

Full Paper

Bibliographic reference.  Ramasubramanian, V. / Srinivas, P. / Sreenivas, T. V. (2005): "Stochastic pronunciation modeling by ergodic-HMM of acoustic sub-word units", In INTERSPEECH-2005, 1361-1364.