In cross-language spoken document retrieval, potentially errorful translations of a source language query must be matched against potentially errorful automatic speech recognition transcriptions of spoken documents Document expansion, using pseudo-relevance feedback to enrich the original transcript with related selective terms, can help to recover matches lost through mistranscription or absent from translation. In this paper we compare three multi-scale strategies for unit selection in different phases of the document expansion and retrieval process on Mandarin Chinese documents, using character bigrams, words, and a hybrid strategy combining bigrams and words. We find that the hybrid bigram-word strategy that uses bigrams to enhance recall and identifies highly selective words to enhance precision for expansion result in the greatest, highly significant improvement over unexpanded documents, and additionally outperforms retrieval on perfect manual transcriptions.
Cite as: Levov, G.-A. (2003) Multi-scale document expansion for Mandarin Chinese. Proc. ISCA Workshop on Multilingual Spoken Document Retrieval (MSDR 2003), 73-78
@inproceedings{levov03_msdr, author={Gina-Anne Levov}, title={{Multi-scale document expansion for Mandarin Chinese}}, year=2003, booktitle={Proc. ISCA Workshop on Multilingual Spoken Document Retrieval (MSDR 2003)}, pages={73--78} }