Recently, the use of phoneme class-conditional probabilities as features (posterior features) for template-based ASR has been proposed. These features have been found to generalize well to unseen data and yield better systems than standard spectral-based features. In this paper, motivated by the high quality of current text-to-speech systems and the robustness of posterior features toward undesired variability, we investigate the use of synthetic speech to generate reference templates. The use of synthetic speech in template-based ASR not only allows to address the issue of in-domain data collection but also expansion of vocabulary. Using 75- and 600-word task-independent and speakerindependent setup on Phonebook database, we investigate different synthetic voices produced by the Festival HTSbased synthesizer trained on CMU ARCTIC databases. Our study shows that synthetic speech templates can yield performance comparable to the natural speech templates, especially with synthetic voices that have high intelligibility.
Index Terms: Speech recognition, template-based approach, posterior features, synthetic reference templates
Bibliographic reference. Soldo, Serena / Magimai-Doss, Mathew / Bourlard, Hervé (2012): "Synthetic references for template-based ASR using posterior features", In INTERSPEECH-2012, 2146-2149.