Toward automatic creation of web-based voice fonts at low cost, automatic speech transcription technology is used to obtain the linguistic features for building HMM-based voices from audio web contents. This paper presents an investigation of the influences of erroneous transcripts on such voices. We simulate varied speech transcript errors by using a large vocabulary automatic speech recognizer (LVASR) to dictate thousands of Japanese utterances from two speakers (a male and a female). A set of experiments is conducted on dozens of HMM voices built upon both dictated and correct transcripts. The results indicate a significant impact of speech transcript errors on the voices. One direct impact is increasing the number of leaf nodes of the decision trees associated with both state duration and F0 but decreasing that with cepstrum in comparison with the reference voices by correct transcripts. The HMM voice quality in mean opinion scores (MOS) is closely related to the word and phone accuracy of speech transcriptions. To achieve fair voice quality with limited training samples, for example, the word and phone accuracy must be higher than 50% and 80%, respectively.
Index Terms: HMM-based speech synthesis, web-based voicefonts, unsupervised approach, HTS
Cite as: Ni, J., Kawai, H. (2010) An investigation of the impact of speech transcript errors on HMM voices. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 246-251
@inproceedings{ni10_ssw, author={Jinfu Ni and Hisashi Kawai}, title={{An investigation of the impact of speech transcript errors on HMM voices}}, year=2010, booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)}, pages={246--251} }