8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

Context Dependent "Long Units" for Speech Recognition

Denis Jouvet, Ronaldo Messina

France Telecom R&D, France

It is expected that longer-than-phoneme units such as syllables or multi-phone units can deal with sources of performance degradation, such as pronunciation variation or coarticulation, better than phoneme-sized units like triphones. The possible number of contextual realizations of those "long units" (LU) is very high, causing an explosion of the number of parameters to be estimated. As the training data are limited, the usual solution is to share parameters between different units to improve parameter estimation. Another problem is how to provide a model for a unit (syllable/multi-phones) that was not present during training (unseen unit). In this paper we evaluate and compare syllable and automatically derived multi-phone units. We introduce a method called "contextual factorization" to share parameters between different models and we propose a figure of merit to decide which decomposition of an unseen syllable is the most appropriate. Performance is improved comparing to a triphone based system.

Full Paper

Bibliographic reference.  Jouvet, Denis / Messina, Ronaldo (2004): "Context dependent "long units" for speech recognition", In INTERSPEECH-2004, 645-648.