This paper presents an approach toward rapid creation of varied synthetic voices at low cost. This consists of amassing audio web contents, extracting usable speech from them, further transcribing the speech to surface text and performing phone-time alignment, and using the speech and transcripts to build HMMbased voices. A set of experiments is conducted to evaluate this approach. The results indicate that: large volumes of audio content are available on the internet, in which more than 33.3% of web radio data are unusable for building voices due to noise, music, and the speakerís overlapping. Among the 14 voices built from limited radio monologues in Japanese, there are three fair (middle of the five-point scale) voices but two voices are bad (the lowest level). The influence of erroneous transcripts on voice quality is significant. In order to achieve fair voice quality with limited speech data, the phone and word accuracy of speech transcriptions must be higher than 80% and 50%, respectively.
Bibliographic reference. Ni, Jinfu / Kawai, Hisashi (2010): "An unsupervised approach to creating web audio contents-based HMM voices", In INTERSPEECH-2010, 849-852.