Using Text and Acoustic Features in Predicting Glottal Excitation Waveforms for Parametric Speech Synthesis with Recurrent Neural Networks

Lauri Juvela, Xin Wang, Shinji Takaki, Manu Airaksinen, Junichi Yamagishi, Paavo Alku


This work studies the use of deep learning methods to directly model glottal excitation waveforms from context dependent text features in a text-to-speech synthesis system. Glottal vocoding is integrated into a deep neural network-based text-to-speech framework where text and acoustic features can be flexibly used as both network inputs or outputs. Long short-term memory recurrent neural networks are utilised in two stages: first, in mapping text features to acoustic features and second, in predicting glottal waveforms from the text and/or acoustic features. Results show that using the text features directly yields similar quality to the prediction of the excitation from acoustic features, both outperforming a baseline system based on using a fixed glottal pulse for excitation generation.


DOI: 10.21437/Interspeech.2016-712

Cite as

Juvela, L., Wang, X., Takaki, S., Airaksinen, M., Yamagishi, J., Alku, P. (2016) Using Text and Acoustic Features in Predicting Glottal Excitation Waveforms for Parametric Speech Synthesis with Recurrent Neural Networks. Proc. Interspeech 2016, 2283-2287.

Bibtex
@inproceedings{Juvela+2016,
author={Lauri Juvela and Xin Wang and Shinji Takaki and Manu Airaksinen and Junichi Yamagishi and Paavo Alku},
title={Using Text and Acoustic Features in Predicting Glottal Excitation Waveforms for Parametric Speech Synthesis with Recurrent Neural Networks},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-712},
url={http://dx.doi.org/10.21437/Interspeech.2016-712},
pages={2283--2287}
}