September 22-25, 1997
We describe a new algorithm for topic classification that allows discrimination among thousands of topics. A mixture of topics explicitly models the fact that each story has multiple topics, that different words are related to different topics, and that most of the words are not related to any topic. The resulting model, trained by EM, has sharper distributions of words that result in more accurate topic classification. We tested the algorithm on transcribed broadcast news texts. When trained on one year of stories containing over 5,000 different topics and tested on new (later) stories the first choice topic was among the manually annotated choices 76% of the time.
Bibliographic reference. Schwartz, Richard / Imai, Toru / Kubala, Francis / Nguyen, Long / Makhoul, John (1997): "A maximum likelihood model for topic classification of broadcast news", In EUROSPEECH-1997, 1455-1458.