8th International Conference on Spoken Language Processing

Jeju Island, Korea
October 4-8, 2004

Using Machine Learning to Cope with Imbalanced Classes in Natural Speech: Evidence from Sentence Boundary and Disfluency Detection

Yang Liu (1,3), Elizabeth Shriberg (1,2), Andreas Stolcke (1,2), Mary Harper (3)

(1) International Computer Science Institute, USA
(2) SRI International, USA; (3) Purdue University, USA

We investigate machine learning techniques for coping with highly skewed class distributions in two spontaneous speech processing tasks. Both tasks, sentence boundary and disfluency detection, provide important structural information for downstream language processing modules. We examine the effect of data set size, task, sampling method (no sampling, downsampling, oversampling, and ensemble sampling), and learning method (bagging, ensemble bagging, and boosting) for a decision tree prosody model.

Full Paper

Bibliographic reference.  Liu, Yang / Shriberg, Elizabeth / Stolcke, Andreas / Harper, Mary (2004): "Using machine learning to cope with imbalanced classes in natural speech: evidence from sentence boundary and disfluency detection", In INTERSPEECH-2004, 1525-1528.