An investigation of cascaded and parallel deep neural networks for speech synthesis is conducted. In these systems, suprasegmental linguistic features (syllable-level and above) are processed separately from segmental features (phone-level and below). The suprasegmental component of the networks learns compact distributed representations of high-level linguistic units without any segmental influence. These representations are then integrated into a frame-level system using a cascaded or a parallel approach. In the cascaded network, suprasegmental representations are used as input to the framelevel network. In the parallel network, segmental and suprasegmental features are processed separately and concatenated at a later stage. These experiments are conducted with a standard set of high-dimensional linguistic features as well as a hand-pruned one. It is observed that hierarchical systems are consistently preferred over the baseline feedforward systems. Similarly, parallel networks are preferred over cascaded networks.
Cite as: Sam Ribeiro, M., Watts, O., Yamagishi, J. (2016) Parallel and cascaded deep neural networks for text-to-speech synthesis. Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 100-105, doi: 10.21437/SSW.2016-17
@inproceedings{samribeiro16_ssw, author={Manuel {Sam Ribeiro} and Oliver Watts and Junichi Yamagishi}, title={{Parallel and cascaded deep neural networks for text-to-speech synthesis}}, year=2016, booktitle={Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)}, pages={100--105}, doi={10.21437/SSW.2016-17} }