Pause insertion prediction using evaluation model of perceptual pause insertion naturalness

Hiroko Muto, Yusuke Ijima, Noboru Miyazaki, Hideyuki Mizuno


This paper describes a pause insertion prediction technique for generating more natural synthesized speech for text-to-speech (TTS) synthesis systems. A novel point of the proposed technique is the use of an evaluation model of perceptual pause insertion naturalness in addition to a prediction model based on machine learning. The evaluation model represents the relationship between several features related to pause insertion and the perceptual pause insertion naturalness obtained in a subjective evaluation. First, using a prediction model based on machine learning, we obtain the N-best sequences that indicate whether or not a pause is present at each phrase boundary. We then estimate pause insertion naturalness scores for each N-best sequence using the evaluation model and select the sequence with the highest naturalness score. Objective and subjective evaluation results show that the proposed technique gives better results than a conventional technique.


 DOI: 10.21437/SpeechProsody.2014-99

Cite as: Muto, H., Ijima, Y., Miyazaki, N., Mizuno, H. (2014) Pause insertion prediction using evaluation model of perceptual pause insertion naturalness. Proc. 7th International Conference on Speech Prosody 2014, 558-562, DOI: 10.21437/SpeechProsody.2014-99.


@inproceedings{Muto2014,
  author={Hiroko Muto and Yusuke Ijima and Noboru Miyazaki and Hideyuki Mizuno},
  title={{Pause insertion prediction using evaluation model of perceptual pause insertion naturalness}},
  year=2014,
  booktitle={Proc. 7th International Conference on Speech Prosody 2014},
  pages={558--562},
  doi={10.21437/SpeechProsody.2014-99},
  url={http://dx.doi.org/10.21437/SpeechProsody.2014-99}
}