This paper proposes a new method for open-vocabulary spokendocument retrieval based on query expansion using related Web documents. A large vocabulary continuous speech recognition (LVCSR) system first transcribes spoken documents into word sequences, which are then segmented into semantically cohesive units (i.e., stories) using a text segmentation technique. Given a text query word, Web documents containing the query word are first retrieved. Each retrieved Web document can be regarded as an expanded form of the original query word. Spoken documents relevant to the query word are then retrieved by searching for the stories with the LVCSR result similar to the previously obtained Web documents. Experimental results show that the proposed method is quite effective in retrieving spoken documents such as broadcast news programs with out-of-vocabulary (OOV) queries. In addition, the proposed method is also useful for ranking retrieval results with in-vocabulary (IV) queries.
Bibliographic reference. Terao, Makoto / Koshinaka, Takafumi / Ando, Shinichi / Isotani, Ryosuke / Okumura, Akitoshi (2008): "Open-vocabulary spoken-document retrieval based on query expansion using related web documents", In INTERSPEECH-2008, 2171-2174.