A prototype multi-stream system with a performance monitor for stream selection is proposed to recognize speech in unknown noise. The speech signal is decomposed into seven band-limited streams. Posterior probabilities of phonemes are estimated by a multi-layer perceptron (MLP) in each of these band-limited streams. Estimated posterior vectors of all 127 combinations (processing streams) of the seven band-limited streams form inputs to a second-stage MLP that estimates posterior probabilities of phonemes in each processing stream. A performance monitor is designed to predict the reliability of individual processing streams based on the outputs from these streams. The top N streams that are least affected by noise are selected and their outputs are averaged to yield the final posterior probability vector used in Viterbi search for the best phoneme sequence. Experimental results show that the proposed technique is effective in dealing with noise.
Bibliographic reference. Variani, Ehsan / Li, Feipeng / Hermansky, Hynek (2013): "Multi-stream recognition of noisy speech with performance monitoring", In INTERSPEECH-2013, 2978-2981.