16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Audio Augmentation for Speech Recognition

Tom Ko (1), Vijayaditya Peddinti (2), Daniel Povey (2), Sanjeev Khudanpur (2)

(1) Huawei Technologies, China
(2) Johns Hopkins University, USA

Data augmentation is a common strategy adopted to increase the quantity of training data, avoid overfitting and improve robustness of the models. In this paper, we investigate audio-level speech augmentation methods which directly process the raw signal. The method we particularly recommend is to change the speed of the audio signal, producing 3 versions of the original signal with speed factors of 0.9, 1.0 and 1.1. The proposed technique has a low implementation cost, making it easy to adopt. We present results on 4 different LVCSR tasks with training data ranging from 100 hours to 960 hours, to examine the effectiveness of audio augmentation in a variety of data scenarios. An average relative improvement of 4.3% was observed across the 4 tasks.

Full Paper

Bibliographic reference.  Ko, Tom / Peddinti, Vijayaditya / Povey, Daniel / Khudanpur, Sanjeev (2015): "Audio augmentation for speech recognition", In INTERSPEECH-2015, 3586-3589.