ISCA Archive SSW 2016
ISCA Archive SSW 2016

Parallel and cascaded deep neural networks for text-to-speech synthesis

Manuel Sam Ribeiro, Oliver Watts, Junichi Yamagishi

An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.


doi: 10.21437/SSW.2016-17

Cite as: Sam Ribeiro, M., Watts, O., Yamagishi, J. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 100-105, doi: 10.21437/SSW.2016-17

@inproceedings{samribeiro16_ssw,
  author={Manuel {Sam Ribeiro} and Oliver Watts and Junichi Yamagishi},
  title={{Parallel and cascaded deep neural networks for text-to-speech synthesis}},
  year=2016,
  booktitle={Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)},
  pages={100--105},
  doi={10.21437/SSW.2016-17}
}