Gaussian Prediction Based Attention for Online End-to-End Speech Recognition

Junfeng Hou, Shiliang Zhang, Li-Rong Dai


Recently end-to-end speech recognition has obtained much attention. One of the popular models to achieve end-to-end speech recognition is attention based encoder-decoder model, which usually generating output sequences iteratively by attending the whole representations of the input sequences. However, predicting outputs until receiving the whole input sequence is not practical for online or low time latency speech recognition. In this paper, we present a simple but effective attention mechanism which can make the encoder-decoder model generate outputs without attending the entire input sequence and can apply to online speech recognition. At each prediction step, the attention is assumed to be a time-moving gaussian window with variable size and can be predicted by using previous input and output information instead of the content based computation on the whole input sequence. To further improve the online performance of the model, we employ deep convolutional neural networks as encoder. Experiments show that the gaussian prediction based attention works well and under the help of deep convolutional neural networks the online model achieves 19.5% phoneme error rate in TIMIT ASR task.


 DOI: 10.21437/Interspeech.2017-751

Cite as: Hou, J., Zhang, S., Dai, L. (2017) Gaussian Prediction Based Attention for Online End-to-End Speech Recognition. Proc. Interspeech 2017, 3692-3696, DOI: 10.21437/Interspeech.2017-751.


@inproceedings{Hou2017,
  author={Junfeng Hou and Shiliang Zhang and Li-Rong Dai},
  title={Gaussian Prediction Based Attention for Online End-to-End Speech Recognition},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3692--3696},
  doi={10.21437/Interspeech.2017-751},
  url={http://dx.doi.org/10.21437/Interspeech.2017-751}
}