A new method to deal with variable distortions of speech during the operation of the system is proposed. First, multiple processing streams are formed by extracting different spectral and temporal modulation components from the speech signal. Information in each stream is used to estimate posterior probabilities of phonemes. Initial values for a weighted integration of these individual estimates are found by normalized cross-correlation of the estimates with the actual phoneme labels on the training data. A statistical model of the final estimated posterior probabilities is used to characterize the system performance. During the operation, the weights in the linear fusion are adapted using particle filtering to optimize the performance. Results on phoneme recognition from noisy speech indicate the effectiveness of the proposed method.
Bibliographic reference. Mesgarani, Nima / Thomas, Samuel / Hermansky, Hynek (2011): "Adaptive stream fusion in multistream recognition of speech", In INTERSPEECH-2011, 2329-2332.