Much remains unsolved in how to predict prosody from text for unlimited Mandarin Chinese TTS. The interactions and the governments between syntactic structure and prosodic structure were still unresolved challenges. By using Part-of-Speech tagging (hence POS), lexical information of text was required, we aimed to find significant patterns of word grouping from analyzing real speech data and such lexical information. This paper reported discrepancies found between lexical words (hence LW) parsed from text and prosodic words (hence PW) annotated from speech data, and proposed a statistical model to predict PWs from LWs. In statistical model, both length of the word and the tagging from POS are two essential features to predict PWs, and the results showed approximately 90% of prediction for PWs, however, it did leave more room for extension. We believe that evidence from PW predictions is a first step towards building prosody models from text. 1. INTRODUCTION Much remains unsolved in how to predict prosody from text for unlimited Mandarin Chinese TTS. Linguistic analyses of text have been insufficient to provide specifications required for speech prosody, both in terms of prosodic units and boundaries, and in intonation contours for connected fluent speech. Though syntactic analyses provide possible boundaries and intonation specification for phrases, location of boundaries and breaks in connected speech require more specification, and prosody of fluent speech goes beyond concatenating simple-sentence intonations into strings. Aiming to build a prosody model for connected fluent speech from the bottom upward, our first step was to set up models that could sufficiently predict PW from LW, and to serve as a base for building speech prosody. In hierarchical rhythmic structures [1], PW is fundamental prosodic unit, while LW is basic syntactic unit in syntactic structure. However gaps and discrepancies were in each layer of syntactic and prosodic structures. Only 67.5% of PWs and LWs were coincident in our prosodic structure tagged corpora (in section 2.3). In this paper we proposed a statistical model for predicting PWs by grouping lexical words. The issues of grouping words to form PWs have been studied in [2, 3], a good word grouping strategy helped construct the temporal organization of speech and rendered spoken utterances natural and fluent. In the following sections, we focused on finding an optimal word grouping strategy by combining lexical information
Cite as: Peng, H., Chen, C., Tseng, C., Chen, K. (2004) Predicting Prosodic Words from Lexical Words--A First Step Towards Predicting Prosody from Text. Proc. International Symposium on Chinese Spoken Language Processing, 173-176
@inproceedings{peng04_iscslp, author={HuaJui Peng and Chiching Chen and Chiuyu Tseng and Kehjiann Chen}, title={{Predicting Prosodic Words from Lexical Words--A First Step Towards Predicting Prosody from Text}}, year=2004, booktitle={Proc. International Symposium on Chinese Spoken Language Processing}, pages={173--176} }