ISCA Archive SSW 2023
ISCA Archive SSW 2023

Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data

Jarod Duret, Yannick Estève, Titouan Parcollet

We propose a method for speech-to-speech emotionpreserving translation that operates at the level of discretespeech units. Our approach relies on the use of multilingualemotion embedding that can capture affective information in alanguage-independent manner. We show that this embeddingcan be used to predict the pitch and duration of speech units ina target language, allowing us to resynthesize the source speechsignal with the same emotional content. We evaluate our approach to English and French speech signals and show that itoutperforms a baseline method that does not use emotional information, including when the emotion embedding is extractedfrom a different language. Even if this preliminary study doesnot address directly the machine translation issue, our resultsdemonstrate the effectiveness of our approach for cross-lingualemotion preservation in the context of speech resynthesis.


doi: 10.21437/SSW.2023-29

Cite as: Duret, J., Estève, Y., Parcollet, T. (2023) Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data. Proc. 12th ISCA Speech Synthesis Workshop (SSW2023), 184-190, doi: 10.21437/SSW.2023-29

@inproceedings{duret23_ssw,
  author={Jarod Duret and Yannick Estève and Titouan Parcollet},
  title={{Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data}},
  year=2023,
  booktitle={Proc. 12th ISCA Speech Synthesis Workshop (SSW2023)},
  pages={184--190},
  doi={10.21437/SSW.2023-29}
}