INTERSPEECH 2006 - ICSLP
A Longer-sized sub-word unit is known to be a better candidate in the development of a continuous speech recognition system. However, the basic problem with such units is the data sparsity. To overcome this problem, researchers have tried to combine longer-sized sub-word unit models with phoneme models. In this paper, we have considered only frequently occurring syllables and VC (Vowel + Consonant) units, and phone-sized units (monophones and triphones) for the development of a continuous speech recognition system. In such a case, even for a single pronunciation of a word, there can be multiple representational baseforms in the lexicon, each with different-sized units. We show that a considerable improvement in recognition performance can be achieved if the baseforms are selected properly. Out of all possible baseforms for a given word in the lexicon, the baseform that maximizes the acoustic likelihood, for possible sub-word unit concatenations to make a word, alone is considered. In the baseline systems' word-lexicon, like pure monophone or triphone-based systems, since only the acoustically weaker baseforms are replaced by baseforms with longer-sized units, the resultant performance is guaranteed to be better than that of baseline systems. The preliminary experiments carried out on the TIMIT speech corpus show a considerable improvement in the recognition performance over a pure monophone/triphone-based systems when the larger-sized units are combined using proper selection of baseforms.
Bibliographic reference. Nagarajan, T. / Vijayalakshmi, P. / O'Shaughnessy, Douglas (2006): "Combining multiple-sized sub-word units in a speech recognition system using baseform selection", In INTERSPEECH-2006, paper 1280-Wed1BuP.12.