Implementation of neural network based synthesizers for Spanish and Basque

Victor Garcia, Inma Hernaez, Eva Navas

This paper describes the implementation of neural-network based Text-to-Speech (TTS) synthesizers for Spanish and Basque. In order to develop this research, the voices of one male and one female speakers, both bilinguals, are used in a data set of around 4 and a half hours for each voice and language. The system uses Tacotron to compute mel-spectrograms from the input text sequence and Waveglow to obtain the resulting audios. Training the mentioned models with a limited amount of data leads to synthesis errors in some utterances, affecting the naturalness of the audios and even producing unintelligible speech. In this paper, we describe the method followed to automatically detect erroneously synthesized audios and the strategy followed to address the causes of the errors. The designed method has been validated by testing the TTSs using a large set of out-of-domain sentences. In the end a fully operational system is developed, with capacity to generate good quality and natural audios, as showcased by the evaluation conducted.

doi: 10.21437/IberSPEECH.2021-48

Garcia, V, Hernaez, I, Navas, E (2021) Implementation of neural network based synthesizers for Spanish and Basque. Proc. IberSPEECH 2021, 225-229, doi: 10.21437/IberSPEECH.2021-48.