In Japanese, every content word has its own H/L pitch pattern when it is uttered isolatedly, called accent type. In a TTS system, this lexical information is usually stored in a dictionary and it is referred to for prosody generation. When converting a written sentence to speech, however, this lexical H/L pattern is often changed according to the context, known as word accent sandhi. This accent change is troublesome for speech synthesis researchers because it is difficult even for native speakers to describe explicitly what kind of mechanism is working for the change although young Japanese learn the mechanism without trouble. For developing a good Japanese TTS system, this implicit and phonological knowledge has to be built in the system. In our previous study [1], we developed a rule-based module for the accent sandhi but it is true that it produced an unignorable number of errors. In this paper, the development of a corpusbased module is described using Conditional Random Fields (CRFs) to predict the change. Although the new module shows the better performance for the prediction than the previous rulebased module, the new module is tuned further by integrating the rule-based knowledge acquired in the previous study.
N. Minematsu, R. Kita, and K. Hirose (2003), "Automatic estimation of accentual attribute values of words for accent sandhi rules of Japanese text-to-speech conversion," Trans. IEICE, vol. E86-D, no.3, pp.550-557
Cite as: Minematsu, N., Kuroiwa, R., Hirose, K., Watanabe, M. (2007) CRF-based statistical learning of Japanese accent sandhi for developing Japanese text-to-speech synthesis systems. Proc. 6th ISCA Workshop on Speech Synthesis (SSW 6), 148-153
@inproceedings{minematsu07_ssw, author={Nobuaki Minematsu and Ryo Kuroiwa and Keikichi Hirose and Michiko Watanabe}, title={{CRF-based statistical learning of Japanese accent sandhi for developing Japanese text-to-speech synthesis systems}}, year=2007, booktitle={Proc. 6th ISCA Workshop on Speech Synthesis (SSW 6)}, pages={148--153} }