INTERSPEECH 2004 - ICSLP
It is expected that longer-than-phoneme units such as syllables or multi-phone units can deal with sources of performance degradation, such as pronunciation variation or coarticulation, better than phoneme-sized units like triphones. The possible number of contextual realizations of those "long units" (LU) is very high, causing an explosion of the number of parameters to be estimated. As the training data are limited, the usual solution is to share parameters between different units to improve parameter estimation. Another problem is how to provide a model for a unit (syllable/multi-phones) that was not present during training (unseen unit). In this paper we evaluate and compare syllable and automatically derived multi-phone units. We introduce a method called "contextual factorization" to share parameters between different models and we propose a figure of merit to decide which decomposition of an unseen syllable is the most appropriate. Performance is improved comparing to a triphone based system.
Bibliographic reference. Jouvet, Denis / Messina, Ronaldo (2004): "Context dependent "long units" for speech recognition", In INTERSPEECH-2004, 645-648.