15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Data Augmentation, Feature Combination, and Multilingual Neural Networks to Improve ASR and KWS Performance for Low-Resource Languages

Zoltán Tüske, Pavel Golik, David Nolden, Ralf Schlüter, Hermann Ney

RWTH Aachen University, Germany

This paper presents the progress of acoustic models for low-resourced languages (Assamese, Bengali, Haitian Creole, Lao, Zulu) developed within the second evaluation campaign of the IARPA Babel project. This year, the main focus of the project is put on training high-performing automatic speech recognition (ASR) and keyword search (KWS) systems from language resources limited to about 10 hours of transcribed speech data. Optimizing the structure of Multilayer Perceptron (MLP) based feature extraction and switching from the sigmoid activation function to rectified linear units results in about 5% relative improvement over baseline MLP features. Further improvements are obtained when the MLPs are trained on multiple feature streams and by exploiting label preserving data augmentation techniques like vocal tract length perturbation. Systematic application of these methods allows to improve the unilingual systems by 4–6% absolute in WER and 0.064–0.105 absolute in MTWV. Transfer and adaptation of multilingually trained MLPs lead to additional gains, clearly exceeding the project goal of 0.3 MTWV even when only the limited language pack of the target language is used.

Full Paper

Bibliographic reference.  Tüske, Zoltán / Golik, Pavel / Nolden, David / Schlüter, Ralf / Ney, Hermann (2014): "Data augmentation, feature combination, and multilingual neural networks to improve ASR and KWS performance for low-resource languages", In INTERSPEECH-2014, 1420-1424.