Though touted as an excellent candidate, past work has yet to demonstrate the value of the syllable for acoustic modeling. One reason is that critical factors such as context-dependency and model clustering are typically neglected in syllable works. This paper presents fragmented syllable models, a means to realize context-dependency for the syllable while constraining the implied explosion in training data requirements. Fragmented syllables only expose their head/tail phones as context, and thus limit the context space for triphone expansion. Furthermore, decision-tree clustering can be used to share data between parts, or fragments, of syllables, to better exploit training data for data-sparse syllables. The best resulting system achieves a 1.8% absolute (5.4% relative) reduction in WER over a baseline triphone acoustic model on a Switchboard-1 conversational telephone speech task.
Bibliographic reference. Thambiratnam, K. / Seide, Frank (2008): "Fragmented context-dependent syllable acoustic models", In INTERSPEECH-2008, 2418-2421.