12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Probabilistic Latent Semantic Analysis for Broadcast News Story Segmentation

Mimi Lu (1), Cheung-Chi Leung (2), Lei Xie (1), Bin Ma (2), Haizhou Li (2)

(1) Northwestern Polytechnical University, China
(2) A*STAR, Singapore

This paper proposes to perform probabilistic latent semantic analysis (PLSA) for broadcast news (BN) story segmentation. PLSA exploits a deeper underlying relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal term matching. Different from text segmentation, lexical based BN story segmentation has to be carried out over LVCSR transcripts, where the incorrect recognition of out-of-vocabulary words inevitably impacts the semantic relation. We use phoneme subwords as the basic term units to address this problem. We integrate a cross entropy measurement with PLSA to depict lexical cohesion and compare its performance with the widely used cosine similarity metric. Furthermore, we evaluate two approaches, namely TextTiling and dynamic programming (DP), for story boundary identification. Experimental results show that the PLSA based methods bring a significant performance boost to story segmentation and the cross entropy based DP approach provides the best performance.

Full Paper

Bibliographic reference.  Lu, Mimi / Leung, Cheung-Chi / Xie, Lei / Ma, Bin / Li, Haizhou (2011): "Probabilistic latent semantic analysis for broadcast news story segmentation", In INTERSPEECH-2011, 1109-1112.