This paper proposes an automatic prosodic labeling technique for constructing speech database used for speech synthesis. In the corpus-based Japanese speech synthesis, it is essential to use annotated speech data with prosodic information such as phrase boundaries and accent types. However, manual annotation is generally time-consuming and expensive. To overcome this problem, we propose an estimation technique of accent types and phrase boundaries from speech waveform and its transcribed text using both language and acoustic models. We use conditional random field (CRF) for the language model, and HMM for the acoustic model which has shown to be effective in prosody modeling in speech synthesis. By introducing HMM, continuously changing features of F0 contours are modeled well and this results in higher estimation accuracy than conventional techniques that use simple polygonal line approximation of F0 contours.
Bibliographic reference. Koriyama, Tomoki / Suzuki, Hiroshi / Nose, Takashi / Shinozaki, Takahiro / Kobayashi, Takao (2014): "Accent type and phrase boundary estimation using acoustic and language models for automatic prosodic labeling", In INTERSPEECH-2014, 2337-2341.