 |
2003 ISCA Workshop on
Multilingual Spoken Document Retrieval
(MSDR2003)
Hong Kong
April 4-5, 2003 |
 |
New Word Learning for Spoken Document Processing through Discovery of Comparable Texts from External Resources
Kuan-Ting Chen (1,3), Shui-Lung Chuang (1), Frank Seide (2), Hsin-Min Wang (1), Lee-Feng Chien (1), Eric Chang (2)
(1) Institute of Information Science, Academia Sinica, Taipei, Taiwan
(2) Microsoft Research Asia, Beijing, China
(3) Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan
This paper presents a new out-of-vocabulary (OOV)
word learning approach that dynamically extends the
pronunciation lexicon and the language model for
large vocabulary continuous speech recognition
(LVCSR) in spoken document retrieval (SDR)
systems. Based on the assumption that the
graphemes as well as the n-gram statistics of the
OOV words can be effectively learned from other
contemporary or in-domain text documents, the
proposed approach suggests an iterative procedure of
dynamic unsupervised new word learning, which
makes uses of the relevant text documents (termed
comparable texts) retrieved from the external
resource, such as special-domain text databases or
the Internet, as the lexicon/language model (LM)
adaptation data. The preliminary experiments were
conducted on Hub-4 ’96 English broadcast news
development set (F0 condition only), using
TREC-2001 WebTrack data (WT10g) as the external
resource. The results showed that, when neither any
key term selection nor new word extraction/filtering
techniques were applied, the proposed framework
significantly reduced the OOV rates of various
artificially created lexicons, from OOV rates 2.64%,
5.18%, 10.66%, to 1.83%, 2.93%, 4.58%,
respectively.
Full Paper
Bibliographic reference.
Chen, Kuan-Ting / Chuang, Shui-Lung / Seide, Frank / Wang, Hsin-Min / Chien, Lee-Feng / Chang, Eric (2003):
"New word learning for spoken document processing through discovery of comparable texts from external resources",
In MSDR-2003, 79-84.