This paper presents a new out-of-vocabulary (OOV) word learning approach that dynamically extends the pronunciation lexicon and the language model for large vocabulary continuous speech recognition (LVCSR) in spoken document retrieval (SDR) systems. Based on the assumption that the graphemes as well as the n-gram statistics of the OOV words can be effectively learned from other contemporary or in-domain text documents, the proposed approach suggests an iterative procedure of dynamic unsupervised new word learning, which makes uses of the relevant text documents (termed comparable texts) retrieved from the external resource, such as special-domain text databases or the Internet, as the lexicon/language model (LM) adaptation data. The preliminary experiments were conducted on Hub-4 96 English broadcast news development set (F0 condition only), using TREC-2001 WebTrack data (WT10g) as the external resource. The results showed that, when neither any key term selection nor new word extraction/filtering techniques were applied, the proposed framework significantly reduced the OOV rates of various artificially created lexicons, from OOV rates 2.64%, 5.18%, 10.66%, to 1.83%, 2.93%, 4.58%, respectively.
Cite as: Chen, K.-T., Chuang, S.-L., Seide, F., Wang, H.-M., Chien, L.-F., Chang, E. (2003) New word learning for spoken document processing through discovery of comparable texts from external resources. Proc. ISCA Workshop on Multilingual Spoken Document Retrieval (MSDR 2003), 79-84
@inproceedings{chen03_msdr, author={Kuan-Ting Chen and Shui-Lung Chuang and Frank Seide and Hsin-Min Wang and Lee-Feng Chien and Eric Chang}, title={{New word learning for spoken document processing through discovery of comparable texts from external resources}}, year=2003, booktitle={Proc. ISCA Workshop on Multilingual Spoken Document Retrieval (MSDR 2003)}, pages={79--84} }