The depth of the neural network is a vital factor that affects its performance. Recently a new architecture called highway network was proposed. This network facilitates the training process of a very deep neural network by using gate units to control a information highway over the conventional hidden layer. For the speech synthesis task, we investigate the performance of highway networks with up to 40 hidden layers. The results suggest that a highway network with 14 non-linear transformation layers is the best choice on our speech corpus and this highway network achieves better performance than a feed-forward network with 14 hidden layers. On the basis of these results, we further investigate a multi-stream highway network where separate highway networks are used to predict different kinds of acoustic features such as the spectral and F0 features. Results of the experiments suggest that the multi-stream highway network can achieve better objective results than the single network that predicts all the acoustic features. Analysis on the output of highway gate units also supports the assumption for the multi-stream network that different hidden representation may be necessary to predict spectral and F0 features.
Cite as: Wang, X., Takaki, S., Yamagishi, J. (2016) Investigating Very Deep Highway Networks for Parametric Speech Synthesis. Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 166-171, doi: 10.21437/SSW.2016-27
@inproceedings{wang16_ssw, author={Xin Wang and Shinji Takaki and Junichi Yamagishi}, title={{Investigating Very Deep Highway Networks for Parametric Speech Synthesis}}, year=2016, booktitle={Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)}, pages={166--171}, doi={10.21437/SSW.2016-27} }