We present a new stochastic approach to estimate accurately phonemes and accents for Japanese TTS (Text-to-Speech) systems. Front-end process of TTS system assigns phonemes and accents to an input plain text, which is critical for creating intelligible and natural speech. Rule-based approaches that build hierarchical structures are widely used for this purpose. However, considering scalability and the ease of domain adaptation, rule-based approaches have well-known limitations. In this paper, we present a stochastic method based on an n-gram model for phonemes and accents estimation. The proposed method estimates not only phonemes and accents but word segmentation and part-of-speech (POS) simultaneously. We implemented a system for Japanese which solves tokenization, linguistic annotation, text-to-phonemes conversion, homograph disambiguation, and accents generation at the same time, and observed promising results.
Cite as: Nagano, T., Mori, S., Nishimura, M. (2005) A stochastic approach to phoneme and accent estimation. Proc. Interspeech 2005, 3293-3296, doi: 10.21437/Interspeech.2005-575
@inproceedings{nagano05_interspeech, author={Tohru Nagano and Shinsuke Mori and Masafumi Nishimura}, title={{A stochastic approach to phoneme and accent estimation}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={3293--3296}, doi={10.21437/Interspeech.2005-575} }