9th Annual Conference of the International Speech Communication Association

Brisbane, Australia
September 22-26, 2008

Open-Vocabulary Spoken-Document Retrieval Based on Query Expansion Using Related Web Documents

Makoto Terao, Takafumi Koshinaka, Shinichi Ando, Ryosuke Isotani, Akitoshi Okumura

NEC Corporation, Japan

This paper proposes a new method for open-vocabulary spokendocument retrieval based on query expansion using related Web documents. A large vocabulary continuous speech recognition (LVCSR) system first transcribes spoken documents into word sequences, which are then segmented into semantically cohesive units (i.e., stories) using a text segmentation technique. Given a text query word, Web documents containing the query word are first retrieved. Each retrieved Web document can be regarded as an expanded form of the original query word. Spoken documents relevant to the query word are then retrieved by searching for the stories with the LVCSR result similar to the previously obtained Web documents. Experimental results show that the proposed method is quite effective in retrieving spoken documents such as broadcast news programs with out-of-vocabulary (OOV) queries. In addition, the proposed method is also useful for ranking retrieval results with in-vocabulary (IV) queries.

Full Paper

Bibliographic reference.  Terao, Makoto / Koshinaka, Takafumi / Ando, Shinichi / Isotani, Ryosuke / Okumura, Akitoshi (2008): "Open-vocabulary spoken-document retrieval based on query expansion using related web documents", In INTERSPEECH-2008, 2171-2174.