ISCA Archive ISCSLP 2002
ISCA Archive ISCSLP 2002

A data-driven indexing approach for Chinese spoken document retrieval

Chun-Jen Wang, Berlin Chen, Lin-Shan Lee

The choice of indexing features is critical to the performance of a retrieval system. Prede- fined, overlapping, fixed-length term sequences are widely used in many retrieval systems. However, predefined feature sets are often riddled with meaningless and non-informative terms, which unavoidably degrades retrieval performance, and explodes the feature set. In this paper we present a statistical approach to derive data-driven term segments as features. We let the data to tell which features are important and which are not. The results show that very satisfactory performance can be achieved with these data-driven indexing features while retaining very compact feature set size. This approach also has the potential to identify domain-specific terminologies or newly-generated phrases.


Cite as: Wang, C.-J., Chen, B., Lee, L.-S. (2002) A data-driven indexing approach for Chinese spoken document retrieval. Proc. International Symposium on Chinese Spoken Language Processing, paper 122

@inproceedings{wang02e_iscslp,
  author={Chun-Jen Wang and Berlin Chen and Lin-Shan Lee},
  title={{A data-driven indexing approach for Chinese spoken document retrieval}},
  year=2002,
  booktitle={Proc. International Symposium on Chinese Spoken Language Processing},
  pages={paper 122}
}