15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Deep Neural Network Based Trainable Voice Source Model for Synthesis of Speech with Varying Vocal Effort

Tuomo Raitio (1), Antti Suni (2), Lauri Juvela (1), Martti Vainio (2), Paavo Alku (1)

(1) Aalto University, Finland
(2) University of Helsinki, Finland

This paper studies a deep neural network (DNN) based voice source modelling method in the synthesis of speech with varying vocal effort. The new trainable voice source model learns a mapping between the acoustic features and the time-domain pitch-synchronous glottal flow waveform using a DNN. The voice source model is trained with various speech material from breathy, normal, and Lombard speech. In synthesis, a normal voice is first adapted to a desired style, and using the flexible DNN-based voice source model, a style-specific excitation waveform is automatically generated based on the adapted acoustic features. The proposed voice source model is compared to a robust and high-quality excitation modelling method based on manually selected mean glottal flow pulses for each vocal effort level and using a spectral matching filter to correctly match the voice source spectrum to a desired style. Subjective evaluations show that the proposed DNN-based method is rated comparable to the baseline method, but avoids the manual selection of the pulses and is computationally faster than a system using a spectral matching filter.

Full Paper

Bibliographic reference.  Raitio, Tuomo / Suni, Antti / Juvela, Lauri / Vainio, Martti / Alku, Paavo (2014): "Deep neural network based trainable voice source model for synthesis of speech with varying vocal effort", In INTERSPEECH-2014, 1969-1973.