ISCA Archive SSW 2023
ISCA Archive SSW 2023

Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control

Martin Lenglet, Olivier Perrotin, Gérard Bailly

Neural Text-To-Speech (TTS) models achieve great performances regarding naturalness, but modeling expressivity remains an ongoing challenge. Some success was found through implicit approaches like Global Style Tokens (GST), but these methods model speech style at utterance-level. In this paper, we propose to add an auxiliary module called Local Style Tokens (LST) in the encoder-decoder pipeline to model local variations in prosody. This module can implement various scales of representations; we chose Word-level and Phoneme-level prosodic representations to assess the capabilities of the proposed module to better model sub-utterance style variations. Objective evaluation of the synthetic speech shows that LST modules better capture prosodic variations on 12 common styles compared to a GST baseline. These results were validated by participants during listening tests.


doi: 10.21437/SSW.2023-19

Cite as: Lenglet, M., Perrotin, O., Bailly, G. (2023) Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 120-126, doi: 10.21437/SSW.2023-19

@inproceedings{lenglet23_ssw,
  author={Martin Lenglet and Olivier Perrotin and Gérard Bailly},
  title={{Local Style Tokens: Fine-Grained Prosodic Representations For TTS Expressive Control}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={120--126},
  doi={10.21437/SSW.2023-19}
}