Current speech synthesis efforts, both in research and in applications, are dominated by methods based on concatenation of spoken units. New progress in the concatenative text-to-speech (TTS) technology can be made mainly from two directions, either by reducing the memory footprint to integrate the system into embedded system, or by improving the synthesized speech quality in terms of intelligibility and naturalness. In this paper, we are focusing on the memory footprint reduction in a Mandarin TTS system. We show that significant memory reductions can be achieved through duration modeling and memory optimization of the lexicon data. The results obtained in the experiments indicate that the memory requirements of the duration data and lexicon can be significantly reduced while keeping the speech quality unaffected. For practical embedded implementations, this is a significant step towards an efficient TTS engine implementation. The applicability of the approach is verified in the speech synthesis system.
Cite as: Tian, J., Nurminen, J., Kiss, I. (2005) Duration modeling and memory optimization in a Mandarin TTS system. Proc. Interspeech 2005, 1929-1932, doi: 10.21437/Interspeech.2005-604
@inproceedings{tian05_interspeech, author={Jilei Tian and Jani Nurminen and Imre Kiss}, title={{Duration modeling and memory optimization in a Mandarin TTS system}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={1929--1932}, doi={10.21437/Interspeech.2005-604} }