![]() |
Sixth International Conference on Spoken Language Processing
|
![]() |
In this paper, we introduce the problem of audio stream phrase recognition for information retrieval for a new National Gallery of the Spoken Word (NGSW). This will be the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings that are of historical content from the 20th Century. We propose a system diagram and discuss critical processing tasks such as: an environment classifier, recognizer model adaptation for acoustic background noise, restricted channels, and speaker variability, natural language processor, and speech enhancement/feature processing. A probe NGSW data set is used to perform experiments using SPHINX-III LVCSR and a previously formulated RSPL-keyword spotting system. Results are reported for WSJ, BN, and NGSW corpora. Results from sub-system evaluations are reported for (i) model adaptation based on mixture weight adjustment with MLLR (reduces WER by 2.6% over a baseline BN trained model), speaker and environmental turn taking using a Bayesian Information Criterion (BIC), and statistical analysis of phrase recognition performance for confidence measure scoring. Finally, we discuss a number of research challenges needed to address the overall task of robust phrase searching in unrestricted corpora.
Bibliographic reference. Hansen, John H. L. / Zhou, Bowen / Akbacak, Murat / Sarikaya, Ruhi / Pellom, Bryan (2000): "Audio stream phrase recognition for a national gallery of the spoken word: "one small step"", In ICSLP-2000, vol.3, 1089-1092.