International Symposium on Chinese Spoken Language Processing (ISCSLP 2002)

Taipei, Taiwan
August 23-24, 2002

A Data-driven Indexing Approach for Chinese Spoken Document Retrieval

Chun-Jen Wang, Berlin Chen, Lin-Shan Lee

National Taiwan University, Taipei, Taiwan

The choice of indexing features is critical to the performance of a retrieval system. Prede- fined, overlapping, fixed-length term sequences are widely used in many retrieval systems. However, predefined feature sets are often riddled with meaningless and non-informative terms, which unavoidably degrades retrieval performance, and explodes the feature set. In this paper we present a statistical approach to derive data-driven term segments as features. We let the data to tell which features are important and which are not. The results show that very satisfactory performance can be achieved with these data-driven indexing features while retaining very compact feature set size. This approach also has the potential to identify domain-specific terminologies or newly-generated phrases.

Full Paper

Bibliographic reference.  WANG, Chun-Jen / CHEN, Berlin / LEE, Lin-shan (2002): "A data-driven indexing approach for Chinese spoken document retrieval", In ISCSLP 2002, paper 122.