A topic detection approach based on a probabilistic framework is proposed to realize topic adaptation of speech recognition systems for long speech archives such as meetings. Since topics in such speech are not clearly defined unlike news stories, we adopt a probabilistic representation of topics based on probabilistic latent semantic analysis (PLSA). A topical sub-space is constructed by PLSA, and speech segments are projected to the subspace, then each segment is represented by a vector which consists of topic probabilities obtained by the projection. Topic detection is performed by clustering these vectors, and topic adaptation is done by collecting relevant texts based on the similarity in this probabilistic representation. In experimental evaluations, the proposed approach demonstrated significant reduction of perplexity and out-of-vocabulary rates as well as robustness against ASR errors.
Bibliographic reference. Akita, Yuya / Nemoto, Yusuke / Kawahara, Tatsuya (2007): "PLSA-based topic detection in meetings for adaptation of lexicon and language model", In INTERSPEECH-2007, 602-605.