ISCA Archive SAPA 2012
ISCA Archive SAPA 2012

Template-based ASR using posterior features and synthetic references: comparing different TTS systems

Serena Soldo, Mathew Magimai-Doss, Hervé Bourlard

In recent works, the use of phone class-conditional posterior probabilities (posterior features) directly as features has provided successful results in template-based ASR systems. In this paper, motivated by the high quality of current text-to-speech systems and the robustness of posterior features toward undesired variability, we investigate the use of synthetic speech to generate reference templates. The use of synthetic speech in template-based ASR not only allows to address the issue of in-domain data collection but also the expansion of the vocabulary. On 75- and 600-word task-independent and speakerindependent setup of Phonebook corpus, we show the feasibility of this approach by investigating different synthetic voices produced by HTS-based synthesizer trained on two different databases. Our study shows that synthetic speech templates can yield performance comparable to the natural speech templates, especially with synthetic voices that have high intelligibility.

Index Terms: Speech recognition, template-based approach, posterior features, synthetic reference templates


Cite as: Soldo, S., Magimai-Doss, M., Bourlard, H. (2012) Template-based ASR using posterior features and synthetic references: comparing different TTS systems. Proc. SAPA-SCALE conference (SAPA 2012), 52-57

@inproceedings{soldo12_sapa,
  author={Serena Soldo and Mathew Magimai-Doss and Hervé Bourlard},
  title={{Template-based ASR using posterior features and synthetic references: comparing different TTS systems}},
  year=2012,
  booktitle={Proc. SAPA-SCALE conference (SAPA 2012)},
  pages={52--57}
}