7th International Conference on Spoken Language Processing
September 16-20, 2002
A natural dialogue system for human-computer interactions includes an understanding module that defines groups of words and phrases that are semantically similar. New domains usually do not have large, annotated corpora, so it is useful to develop methods of automatically inducing semantic groups (concepts). Classes can be induced from unannotated corpora by means of a context-dependent similarity measure, such as the Kullback-Leibler distance. However, the precision of auto-induced classes is reduced in cases where the statistics are poor, or where words of different parts of speech may occur in similar lexical contexts. We address this issue by augmenting a semantic generalizer with three new modules, a part-of-speech (POS) tagger to preprocess the list of candidate word pairs, trigram instead of bigram contexts, and context thresholding. The subjective quality of auto-induced classes is compared for these three methodologies for a large newspaper text (WSJ) corpus. We show that context thresholding has the biggest impact on inducing higher quality classes. The best results were obtained for a context threshold of 3 extant bigrams and trigrams. For bigram contexts, with POS tags, the precision was 88% for the first 50 clusters, 75% for the first 100 clusters, and 65% for the first 150. Similar results were attained for trigram contexts and no POS tags.
Bibliographic reference. Pargellis, Andrew / Fosler-Lussier, Eric / Tsai, Augustine (2002): "Using part-of-speech tags, context thresholding, and trigram contexts to improve the auto-induction of semantic classes", In ICSLP-2002, 605-608.