8th European Conference on Speech Communication and Technology

Geneva, Switzerland
September 1-4, 2003


Using Syllable-Based Indexing Features and Language Models to Improve German Spoken Document Retrieval

Martha Larson, Stefan Eickeler

Fraunhofer Institute for Media Communication, Germany

Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge for information retrieval. The experimental results reported in this paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that syllable-based speech recognition language models can successfully be used to generate syllable-based indexing features. Recognition is carried out with a 5k syllable language model and a 10k mixed-unit language model whose vocabulary consists of a mixture of words and syllables. Both language models make retrieval performance possible that is comparable to that attained when a large vocabulary word-based language model is used. Experiments are performed on a spoken document collection consisting of short German-language radio documentaries. First, the vector space model is applied to a known item retrieval task and a similar-document search. Then, the known item retrieval task is further explored with a Levenshtein-distance-based fuzzy word match.

Full Paper

Bibliographic reference.  Larson, Martha / Eickeler, Stefan (2003): "Using syllable-based indexing features and language models to improve German spoken document retrieval", In EUROSPEECH-2003, 1217-1220.