ISCA Archive Interspeech 2006

Using latent semantic indexing for morph-based spoken document retrieval

Ville T. Turunen, Mikko Kurimo

Previously, phone-based and word-based approaches have been used for spoken document retrieval. The former suffers from high error rates and the latter from limited vocabulary of the recognizer. Our method relies on unlimited vocabulary continuous speech recognizer that uses morpheme-like units discovered in an unsupervised manner. The morpheme-like units, or "morphs" for short, have been successfully used also as index terms. One problem using morphs as index terms is that the segmentation does not always separate the same stem for different inflected forms of the same word. This resembles the problem of synonyms. In this paper, we apply latent semantic indexing to morph based retrieval. The idea is to project morphs that correspond to the same word, as well as other semantically related terms, to the same dimension. The results show clear improvements in Finnish spoken document retrieval performance.

doi: 10.21437/Interspeech.2006-117

Cite as: Turunen, V.T., Kurimo, M. (2006) Using latent semantic indexing for morph-based spoken document retrieval. Proc. Interspeech 2006, paper 1220-Mon2WeO.6, doi: 10.21437/Interspeech.2006-117

  author={Ville T. Turunen and Mikko Kurimo},
  title={{Using latent semantic indexing for morph-based spoken document retrieval}},
  booktitle={Proc. Interspeech 2006},
  pages={paper 1220-Mon2WeO.6},