This paper presents a data selection approach where spoken utterances are selected in a sequential fashion from a large out-of-domain data set to match the utterance distribution of an in-domain data set. We propose to represent each utterance by its iVector, a low dimensional vector indicating the coordinate of that utterance in a subspace acoustic model. We show that the distribution of iVectors can characterize a data set and enables distinguishing subsets of utterances from different domains. Last, we present experimental speech recognition results based on a system trained on a data set constructed by the proposed algorithm and a comparison with random data selection.
Bibliographic reference. Siohan, Olivier / Bacchiani, Michiel (2013): "ivector-based acoustic data selection", In INTERSPEECH-2013, 657-661.