16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Active Learning Based Data Selection for Limited Resource STT and KWS

Thiago Fraga-Silva (1), Jean-Luc Gauvain (2), Lori Lamel (2), Antoine Laurent (1), Viet-Bac Le (1), Abdel Messaoudi (1)

(1) Vocapia Research, France
(2) LIMSI, France

This paper presents first results in using active learning (AL) for training data selection in the context of the IARPA-Babel program. Given an initial training data set, we aim to automatically select additional data (from an untranscribed pool data set) for manual transcription. Initial and selected data are then used to build acoustic and language models for speech recognition. The goal of the AL task is to outperform a baseline system built using a pre-defined data selection with the same amount of data, the Very Limited Language Pack (VLLP) condition. AL methods based on different selection criteria have been explored. Compared to the VLLP baseline, improvements are obtained in terms of Word Error Rate and Actual Term Weighted Values for the Lithuanian language. A description of methods and an analysis of the results are given. The AL selection also outperforms the VLLP baseline for other IARPA-Babel languages, and will be further tested in the upcoming NIST OpenKWS 2015 evaluation.

Full Paper

Bibliographic reference.  Fraga-Silva, Thiago / Gauvain, Jean-Luc / Lamel, Lori / Laurent, Antoine / Le, Viet-Bac / Messaoudi, Abdel (2015): "Active learning based data selection for limited resource STT and KWS", In INTERSPEECH-2015, 3159-3163.