Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis

Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi


A top-down hierarchical system based on deep neural networks is investigated for the modeling of prosody in speech synthesis. Suprasegmental features are processed separately from segmental features and a compact distributed representation of high-level units is learned at syllable-level. The suprasegmental representation is then integrated into a frame-level network. Objective measures show that balancing segmental and suprasegmental features can be useful for the frame-level network. Additional features incorporated into the hierarchical system are then tested. At the syllable-level, a bag-of-phones representation is proposed and, at the word-level, embeddings learned from text sources are used. It is shown that the hierarchical system is able to leverage new features at higher-levels more efficiently than a system which exploits them directly at the frame-level. A perceptual evaluation of the proposed systems is conducted and followed by a discussion of the results.


DOI: 10.21437/Interspeech.2016-1034

Cite as

Ribeiro, M.S., Watts, O., Yamagishi, J. (2016) Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis. Proc. Interspeech 2016, 3186-3190.

Bibtex
@inproceedings{Ribeiro+2016,
author={Manuel Sam Ribeiro and Oliver Watts and Junichi Yamagishi},
title={Syllable-Level Representations of Suprasegmental Features for DNN-Based Text-to-Speech Synthesis},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-1034},
url={http://dx.doi.org/10.21437/Interspeech.2016-1034},
pages={3186--3190}
}