INTERSPEECH 2012
13th Annual Conference of the International Speech Communication Association

Portland, OR, USA
September 9-13, 2012

Boosting Classification Based Speech Separation Using Temporal Dynamics

Yuxuan Wang (1), DeLiang Wang (1,2)

(1) Department of Computer Science and Engineering; (2) Center for Cognitive Science;
The Ohio State University, USA

Significant advances in speech separation have been made by formulating it as a classification problem, where the desired output is the ideal binary mask (IBM). Previous work does not explicitly model the correlation between neighboring time-frequency units and standard binary classifiers are used. As one of the most important characteristics of speech signal is its temporal dynamics, the IBM contains highly structured, instead of, random patterns. In this study, we incorporate temporal dynamics into classification by employing structured output learning. In particular, we use linear-chain structured perceptrons to account for the interactions of neighboring labels in time. However, the performance of structured perceptrons largely depends on the linear separability of features. To address this problem, we employ pretrained deep neural networks to automatically learn effective feature functions for structured perceptrons. The experiments show that the proposed system significantly outperforms previous IBM estimation systems.

Index Terms: Monaural speech separation, temporal dynamics, structured perceptron, deep neural networks

Full Paper

Bibliographic reference.  Wang, Yuxuan / Wang, DeLiang (2012): "Boosting classification based speech separation using temporal dynamics", In INTERSPEECH-2012, 1528-1531.