On Online Attention-Based Speech Recognition and Joint Mandarin Character-Pinyin Training

William Chan, Ian Lane


In this paper, we explore the use of attention-based models for online speech recognition without the usage of language models or searching. Our model is based on an attention-based neural network which directly emits English/Mandarin characters as outputs. The model jointly learns the pronunciation, acoustic and language model. We evaluate the model for online speech recognition on English and Mandarin. On English, we achieve a 33.0% WER on the WSJ task, or a 5.4% absolute reduction in WER compared to an online CTC based system. We also introduce a new training method and show how we can learn joint Mandarin Character-Pinyin models. Our Mandarin character only model achieves a 72% CER on the GALE Phase 2 evaluation, and with our joint Mandarin Character-Pinyin model, we achieve 59.3% CER or 12.7% absolute improvement over the character only model.


DOI: 10.21437/Interspeech.2016-334

Cite as

Chan, W., Lane, I. (2016) On Online Attention-Based Speech Recognition and Joint Mandarin Character-Pinyin Training. Proc. Interspeech 2016, 3404-3408.

Bibtex
@inproceedings{Chan+2016,
author={William Chan and Ian Lane},
title={On Online Attention-Based Speech Recognition and Joint Mandarin Character-Pinyin Training},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-334},
url={http://dx.doi.org/10.21437/Interspeech.2016-334},
pages={3404--3408}
}