Interspeech'2005 - Eurospeech
The goal of producing a corpus-based synthesizer with the owner's voice can only be achieved if the system can handle recordings with less than ideal characteristics. One of the limitations is that a normal speaker does not always pronounce a word exactly as predicted by the language rules. In this work we compare two methods for handling variations on word pronunciation for corpus-based speech synthesizers. Both approaches rely on a speech corpus aligned with a phone-level segmentation tool that allows alternative word pronunciations. The first approach performs an alignment between the observed pronunciation and the canonical form used in the system's lexicon, allowing the mapping of the time labels from the observed phones into the canonical form. At synthesis time the unit selection is performed on the phone sequence predicted by the system. In the second approach, no modification is performed on the phone sequence generated by the segmentation tool. This way, at synthesis time, the words are converted into phones by using the speaker's word pronunciation, rather than the system's lexicon. Finally, both approaches are compared by evaluating the naturalness of the signals generated by each approach.
Bibliographic reference. Paulo, Sérgio / Oliveira, Luís C. (2005): "Reducing the corpus-based TTS signal degradation due to speaker's word pronunciations", In INTERSPEECH-2005, 1089-1092.