5th International Conference on Spoken Language Processing
The purpose of this research is to investigate methods for applying speech recognition techniques to improve the productivity of off-line captioning for television. We posit that existing corpora for training continuous speech recognisers are unrepresentative of the acoustic conditions of television soundtracks. To evaluate the use of application specific models to this task we have developed a soundtrack corpus (representing a single genre of television programming) for acoustic analysis and a text corpus (from the same genre) for language modelling. These corpora are built from components of the manual captioning process. Captions were used to automatically segment and label the acoustic soundtrack data at sentence level, with manual post-processing to classify and verify the data. The text corpus was derived using automatic processing from approximately 1 million words of caption text. The results confirm the acoustic profile of the task to be characteristically different to that of most other speech recognition tasks (with the soundtrack corpus being almost devoid of clean speech). The text corpus indicates that application specific language modelling will be effective for the chosen genre, although a lexicon providing complete lexical coverage is unattainable. There is a high correspondence between captions and soundtrack speech for the chosen genre, confirming that closed-captions can be a useful data source for generating labelled acoustic data. The corpora provide a high quality resource to support further research into automated speech recognition.
Bibliographic reference. Ahmer, Ingrid / King, Robin W. (1998): "Automated captioning of television programs: development and analysis of a soundtrack corpus", In ICSLP-1998, paper 0419.