Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation

Sheng Li, Dabre Raj, Xugang Lu, Peng Shen, Tatsuya Kawahara, Hisashi Kawai


The end-to-end (E2E) model allows for training of automatic speech recognition (ASR) systems without having to consider the acoustic model, lexicon, language model and complicated decoding algorithms, which are integral to conventional ASR systems. Recently, the transformer-based E2E ASR model (ASR-Transformer) showed promising results in many speech recognition tasks. The most common practice is to stack a number of feed-forward layers in the encoder and decoder. As a result, the addition of new layers improves speech recognition performance significantly. However, this also leads to a large increase in the number of parameters and severe decoding latency. In this paper, we propose to reduce the model complexity by simply reusing parameters across all stacked layers instead of introducing new parameters per layer. In order to address the slight reduction in recognition quality we propose to augment the speech inputs with bags-of-attributes. As a result we obtain a highly compressed, efficient and high quality ASR model.


 DOI: 10.21437/Interspeech.2019-2112

Cite as: Li, S., Raj, D., Lu, X., Shen, P., Kawahara, T., Kawai, H. (2019) Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation. Proc. Interspeech 2019, 4400-4404, DOI: 10.21437/Interspeech.2019-2112.


@inproceedings{Li2019,
  author={Sheng Li and Dabre Raj and Xugang Lu and Peng Shen and Tatsuya Kawahara and Hisashi Kawai},
  title={{Improving Transformer-Based Speech Recognition Systems with Compressed Structure and Speech Attributes Augmentation}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4400--4404},
  doi={10.21437/Interspeech.2019-2112},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2112}
}