INTERSPEECH 2011
12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Optimal Selection of Limited Vocabulary Speech Corpora

Hui Lin, Jeff Bilmes

University of Washington, USA

We address the problem of finding a subset of a large speech data corpus that is useful for accurately and rapidly prototyping novel and computationally expensive speech recognition architectures. To solve this problem, we express it as an optimization problem over submodular functions. Quantities such as vocabulary size (or quality) of a set of utterances, or quality of a bundle of word types are submodular functions which make finding the optimal solutions possible. We, moreover, are able to express our approach using graph cuts leading to a very fast implementation even on large initial corpora. We show results on the Switchboard-I corpus, demonstrating improved results over previous techniques for this purpose. We also demonstrate the variety of the resulting corpora that may be produced using our method.

Full Paper

Bibliographic reference.  Lin, Hui / Bilmes, Jeff (2011): "Optimal selection of limited vocabulary speech corpora", In INTERSPEECH-2011, 1489-1492.