Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information

Viacheslav Klimkov, Adam Nadolski, Alexis Moinet, Bartosz Putrycz, Roberto Barra-Chicote, Thomas Merritt, Thomas Drugman


Phrasing structure is one of the most important factors in increasing the naturalness of text-to-speech (TTS) systems, in particular for long-form reading. Most existing TTS systems are optimized for isolated short sentences, and completely discard the larger context or structure of the text.

This paper presents how we have built phrasing models based on data extracted from audiobooks. We investigate how various types of textual features can improve phrase break prediction: part-of-speech (POS), guess POS (GPOS), dependency tree features and word embeddings. These features are fed into a bidirectional LSTM or a CART baseline. The resulting systems are compared using both objective and subjective evaluations. Using BiLSTM and word embeddings proves to be beneficial.


 DOI: 10.21437/Interspeech.2017-419

Cite as: Klimkov, V., Nadolski, A., Moinet, A., Putrycz, B., Barra-Chicote, R., Merritt, T., Drugman, T. (2017) Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information. Proc. Interspeech 2017, 1064-1068, DOI: 10.21437/Interspeech.2017-419.


@inproceedings{Klimkov2017,
  author={Viacheslav Klimkov and Adam Nadolski and Alexis Moinet and Bartosz Putrycz and Roberto Barra-Chicote and Thomas Merritt and Thomas Drugman},
  title={Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1064--1068},
  doi={10.21437/Interspeech.2017-419},
  url={http://dx.doi.org/10.21437/Interspeech.2017-419}
}