Current approaches to statistical parametric speech synthesis using
Neural Networks generally require input at the same temporal resolution
as the output, typically a frame every 5ms, or in some cases at waveform
sampling rate. It is therefore necessary to fabricate highly-redundant
frame-level (or sample-level) linguistic features at the input. This
paper proposes the use of a hierarchical encoder-decoder model to perform
the sequence-to-sequence regression in a way that takes the input linguistic
features at their original timescales, and preserves the relationships
between words, syllables and phones. The proposed model is designed
to make more effective use of suprasegmental features than conventional
architectures, as well as being computationally efficient. Experiments
were conducted on prosodically-varied audiobook material because the
use of supra-segmental features is thought to be particularly important
in this case. Both objective measures and results from subjective listening
tests, which asked listeners to focus on prosody, show that the proposed
method performs significantly better than a conventional architecture
that requires the linguistic input to be at the acoustic frame rate.
We provide code and a recipe to enable our system to be reproduced
using the Merlin toolkit.
Cite as: Ronanki, S., Watts, O., King, S. (2017) A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis. Proc. Interspeech 2017, 1133-1137, doi: 10.21437/Interspeech.2017-628
@inproceedings{ronanki17_interspeech, author={Srikanth Ronanki and Oliver Watts and Simon King}, title={{A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1133--1137}, doi={10.21437/Interspeech.2017-628} }