Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System

Chanwoo Kim, Minkyu Shin, Abhinav Garg, Dhananjaya Gowda


In this paper, we present an improved vocal tract length perturbation (VTLP) algorithm as a data augmentation technique. VTLP is usually accomplished by adjusting the center frequencies of mel filterbank in [1]. Compared to the conventional approach, we re-synthesize waveforms from the frequency-warped spectra using overlap and addition (OLA). This approach had two advantages: First, we can apply an “acoustic simulator” [2, 3] after performing the VTLP-based frequency warping. Second, we may use a different window length for frequency warping from that used in feature processing. We observe that the best performance was obtained when the warping coefficient distribution is between 0.8 and 1.2, and the window length is 50 ms. We obtained 3.66% WER and 12.39% WER on the Librispeech test-clean and test-other using an attention-based end-to-end speech recognition system without using any Language Models (LMs). Using the shallow-fusion technique with a Transformer LM, we achieved 2.44% WER and 8.29% WER on the Librispeech test-clean and test-other sets. To the best of our knowledge, the 2.44% WER on the test-clean is the best result ever reported on this test set.


 DOI: 10.21437/Interspeech.2019-3227

Cite as: Kim, C., Shin, M., Garg, A., Gowda, D. (2019) Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System. Proc. Interspeech 2019, 739-743, DOI: 10.21437/Interspeech.2019-3227.


@inproceedings{Kim2019,
  author={Chanwoo Kim and Minkyu Shin and Abhinav Garg and Dhananjaya Gowda},
  title={{Improved Vocal Tract Length Perturbation for a State-of-the-Art End-to-End Speech Recognition System}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={739--743},
  doi={10.21437/Interspeech.2019-3227},
  url={http://dx.doi.org/10.21437/Interspeech.2019-3227}
}