Morph-based spoken document retrieval uses morpheme-like subword units for both language modeling and as index terms. Problems of out-of-vocabulary (OOV) words are avoided as the morph recognizer can recognize any word in speech as a sequence of subwords. The effect of previously unseen query words (i.e. words that are not in the language model training text) is analyzed for Finnish spoken document retrieval. The performance of the morph-based system is compared to a word-based approach. Language models with artificially high OOV query word rates are built and the results show that morph-based retrieval suffers significantly less from the OOV query words than word-based. Extracting alternative recognition candidates from confusion networks further improves the results, especially for morph-based retrieval.
Bibliographic reference. Turunen, Ville T. (2008): "Reducing the effect of OOV query words by using morph-based spoken document retrieval", In INTERSPEECH-2008, 2158-2161.