12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Latent Topic Modeling for Audio Corpus Summarization

Timothy J. Hazen

MIT Lincoln Laboratory, USA

This work presents techniques for automatically summarizing the topical content of an audio corpus. Probabilistic latent semantic analysis (PLSA) is used to learn a set of latent topics in an unsupervised fashion. These latent topics are ranked by their relative importance in the corpus and a summary of each topic is generated from signature words that aptly describe the content of that topic. This paper presents techniques for producing a high quality summarization. An example summarization of conversational data from the Fisher corpus that demonstrates the effectiveness of our approach is presented and evaluated.

Full Paper

Bibliographic reference.  Hazen, Timothy J. (2011): "Latent topic modeling for audio corpus summarization", In INTERSPEECH-2011, 913-916.