We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings orannotations. Many TTS models include a phoneme durationmodel. A simple but effective method to achieve emphasizedspeech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better thanspectrogram modification techniques improving naturalness by7.3% and correct testers’ identification of the emphasized wordin a sentence by 40% on a reference female en-US voice. Weshow that this technique significantly closes the gap to methodsthat require explicit recordings. The method proved to be scalable and preferred in all four languages tested (English, Spanish, Italian, German), for different voices and multiple speakingstyles.
Cite as: Joly, A., Nicolis, M., Peterova, E., Lombardi, A., Abbas, A., Korlaar, A.v., Hussain, A., Sharma, P., Moinet, A., Lajszczak, M., Karanasou, P., Bonafonte, A., Drugman, T., Sokolova, E. (2023) Controllable Emphasis with zero data for text-to-speech. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 113-119, doi: 10.21437/SSW.2023-18
@inproceedings{joly23_ssw, author={Arnaud Joly and Marco Nicolis and Ekaterina Peterova and Alessandro Lombardi and Ammar Abbas and Arent van Korlaar and Aman Hussain and Parul Sharma and Alexis Moinet and Mateusz Lajszczak and Penny Karanasou and Antonio Bonafonte and Thomas Drugman and Elena Sokolova}, title={{Controllable Emphasis with zero data for text-to-speech}}, year=2023, booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)}, pages={113--119}, doi={10.21437/SSW.2023-18} }