8th Annual Conference of the International Speech Communication Association

Antwerp, Belgium
August 27-31, 2007

Modeling the Statistical Behavior of Lexical Chains to Capture Word Cohesiveness for Automatic Story Segmentation

Shing-kai Chan, Lei Xie, Helen Meng

Chinese University of Hong Kong, China

We present a mathematically rigorous framework for modeling the statistical behavior of lexical chains for automatic story segmentation of broadcast news audio. Lexical chains were first proposed in [1] to connect related terms within a story, as an embodiment of lexical cohesion. The vocabulary within a story tends to be cohesive, while a change in the vocabulary distribution tends to signify a topic shift that occurs across a story boundary. Previous work focused on the concept and nature of lexical chains but performed story segmentation based on arbitrary thresholding. This work proposes the use of the lognormal distribution to capture the statistical behavior of lexical chains, together with data-driven parameter selection for lexical chain formation. Experimentation based on the TDT-2 Mandarin Corpus shows that the proposed statistical model leads to better story segmentation, where the F1-measure increased from 0.468 to 0.641.

Full Paper

Bibliographic reference.  Chan, Shing-kai / Xie, Lei / Meng, Helen (2007): "Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation", In INTERSPEECH-2007, 2581-2584.