This paper describes a novel approach to the automatic selection of training sentences from a system-generated data feed for the development of high-precision language models (LMs) required for speech-enabled voice interface applications in the TV search domain. We develop a set of heuristic rules to select training sentences directly from the TV electronic programming guide (EPG) in their metadata form. The training corpus constructed using the selection algorithms encoded with the historical EPG data enables the adapted LMs to have a considerably lower perplexity while achieving a significant reduction in word error rate (WER). When evaluated using the user-generated spoken queries to an experimental TV search application, a 10% absolute reduction of WER is reported over the baseline LMs created without using the training sentences generated from the historical EPG data.
Bibliographic reference. Chang, Harry M. (2013): "Heuristic selection of training sentences from historical TV guide for semi-supervised LM adaptation", In INTERSPEECH-2013, 2227-2231.