Symposium on Machine Learning in Speech and Language Processing (MLSLP)
Bellevue, WA, USA
A new method to deal with an unexpected harmful variability (noise) in speech during the operation of the system is reviewed. The fundamental idea is to derive in the training phase statistics of the system output for the data on which the system was trained and adaptively modify the system so that statistics derived during the operation are similar. Multiple processing streams are formed by extracting different spectral and temporal modulation components from the speech signal. Information in each stream is used to estimate posterior probabilities of speech sounds (posteriogram) in each stream, and these estimates are fused to derive the final posteriogram. The autocorrelation matrix of a modified final posteriogram is adopted as the measure that summarizes the system performance. Initial setup of the fusion module is found by cross-correlating the probability estimates with phoneme labels on training data. During an operation, the matrix derived on the training data serves as the desirable target and the fusion module is modified to optimize the system performance. Results on phoneme recognition from noisy speech indicate the effectiveness of the method.
Bibliographic reference. Hermansky, Hynek / Mesgarani, Nima / Thomas, Samuel (2011): "Performance monitoring for robustness in automatic recognition of speechi", In MLSLP-2011, 31-34.