16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Towards End-to-End Speech Recognition for Chinese Mandarin Using Long Short-Term Memory Recurrent Neural Networks

Jie Li, Heng Zhang, Xinyuan Cai, Bo Xu

Chinese Academy of Sciences, China

End-to-end speech recognition systems have been successfully designed for English. Taking into account the distinctive characteristics between Chinese Mandarin and English, it is worthy to do some additional work to transfer these approaches to Chinese. In this paper, we attempt to build a Chinese speech recognition system using end-to-end learning method. The system is based on a combination of deep Long Short-Term Memory Projected (LSTMP) network architecture and the Connectionist Temporal Classification objective function (CTC). The Chinese characters (the number is about 6,000) are used as the output labels directly. To integrate language model information during decoding, the CTC Beam Search method is adopted and optimized to make it more effective and more efficient. We present the first-pass decoding results which are obtained by decoding from scratch using CTC-trained network and language model. Although these results are not as good as the performance of DNN-HMMs hybrid system, they indicate that it is feasible to choose Chinese characters as the output alphabet in the end-to-end speech recognition system.

Full Paper

Bibliographic reference.  Li, Jie / Zhang, Heng / Cai, Xinyuan / Xu, Bo (2015): "Towards end-to-end speech recognition for Chinese Mandarin using long short-term memory recurrent neural networks", In INTERSPEECH-2015, 3615-3619.