15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

TTS Synthesis with Bidirectional LSTM Based Recurrent Neural Networks

Yuchen Fan (1), Yao Qian (2), Feng-Long Xie (2), Frank K. Soong (2)

(1) Shanghai Jiao Tong University, China
(2) Microsoft, China

Feed-forward, Deep neural networks (DNN)-based text-to-speech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems. However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, the dynamic features are commonly used to constrain speech parameter trajectory generation in HMM-based TTS [2]. In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurrence information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, i.e., lower hidden layers with a feed-forward structure which is cascaded with upper hidden layers with a bidirectional RNN structure of LSTM, can outperform either the conventional, decision tree-based HMM, or a DNN TTS system, both objectively and subjectively. The speech trajectory generated by the BLSTM-RNN TTS is fairly smooth and no dynamic constraints are needed.

Full Paper

Bibliographic reference.  Fan, Yuchen / Qian, Yao / Xie, Feng-Long / Soong, Frank K. (2014): "TTS synthesis with bidirectional LSTM based recurrent neural networks", In INTERSPEECH-2014, 1964-1968.