In conventional concept-to-speech (CTS) methods, a common step is predicting abstract prosodic descriptions, such as the locations of accents and phrase boundaries, from the linguistic information provided by the text generation module. But the prediction results always contain errors, and unacceptable prosodic prediction may ruin the synthesized speech. In addition, linguistic information, which can be given conveniently and accurately by text generation, has not been directly utilized in the acoustic modelling and speech generation of CTS. This paper displays a CTS method utilizing HMM-based speech synthesis (HTS) and a text generation module called Komet-Penman multilingual (KPML). In this method, syntagmatic features derived from the linguistic information given by KPML is directly added to the context features for context-dependent HMM modelling. Further, prosodic features are discarded during acoustic modelling to avoid costly prosodic annotation on the training waveforms and inaccurate prosodic prediction at synthesis time. Experiments show that the proposed method performs no worse than the conventional method with automatic prosodic prediction. When manual prosodic annotation on the training corpus is unavailable, the proposed method performs better.
Bibliographic reference. Wang, Xin / Ling, Zhen-Hua / Dai, Li-Rong (2014): "Concept-to-speech generation by integrating syntagmatic features into HMM-based speech synthesis", In INTERSPEECH-2014, 2942-2946.