11th Annual Conference of the International Speech Communication Association

Makuhari, Chiba, Japan
September 26-30. 2010

Maximum Lexical Cohesion for Fine-Grained News Story Segmentation

Zihan Liu (1), Lei Xie (1), Wei Feng (2)

(1) Northwestern Polytechnical University, China
(2) City University of Hong Kong, China

We propose a maximum lexical cohesion (MLC) approach to news story segmentation. Unlike sentence-dependent lexical methods, our approach is able to detect story boundaries at finer word/subword granularity, and thus is more suitable for speech recognition transcripts which have no sentence delimiters. The proposed segmentation goodness measure takes account of both lexical cohesion and a prior preference of story length. We measure the lexical cohesion of a segment by the KL-divergence from its word distribution to an associated piecewise uniform distribution. Taking account of the uneven contributions of different words to a story, the cohesion measure is further refined by two word weighting schemes, i.e. the inverse document frequency (IDF) and a new weighting method called difference from expectation (DFE). We then propose a dynamic programming solution to exactly maximize the segmentation goodness and efficiently locate story boundaries in polynomial time. Experimental results show that our MLC approach outperforms several state-of-the-art lexical methods.

Full Paper

Bibliographic reference.  Liu, Zihan / Xie, Lei / Feng, Wei (2010): "Maximum lexical cohesion for fine-grained news story segmentation", In INTERSPEECH-2010, 1301-1304.