In this paper, a non-negative matrix factorization (NMF)-based document clustering approach is proposed for the cluster-based language model for spoken document retrieval. The retrieval language model comprises three different unigram models: a whole corpus collect-based unigram, document-based unigram, and a document clustering-based unigram. They are combined with double linear interpolations. Document clustering is realized via the NMF method; each document is clustered into an axis in which it has maximum projection in the latent semantic space derived by the NMF. The initialization of NMF, which is an important factor influencing NMF performance, is based on the clustered results of the K-means clustering approach. Using these approaches, retrieval experiments are conducted on a test collection from the corpus of spontaneous Japanese (CSJ). It is found that the proposed method significantly outperforms the conventional vector space model (VSM), the maximum improvement of the retrieval perform-ance (mean average precision: MAP) exceeds 36%, outstripping the conventional query likelihood model, which has improvement of 7.4%. It is also found that the proposed method surpasses the K-means clustering method when adequate initialization of NMF is used.
Bibliographic reference. Hu, Xinhui / Isotani, Ryosuke / Kawai, Hisashi / Nakamura, Satoshi (2010): "Cluster-based language model for spoken document retrieval using NMF-based document clustering", In INTERSPEECH-2010, 705-708.