EUROSPEECH 2003 - INTERSPEECH 2003
Unlike automatic speech recognition systems, humans can understand speech when other competing sounds are present Although the theory of auditory scene analysis (ASA) may help to explain this ability, some perceptual experiments show fusion of the speech signal under circumstances in which ASA principles might be expected to cause segregation. We propose a model of multi-resolution ASA that uses both high- and low- resolution representations of the auditory signal in parallel in order to resolve this conflict. The use of parallel representations reduces variability for pattern-matching while retaining the ability to identify and segregate low-level features of the signal. An important feature of the model is the assumption that features of the auditory signal are fused together unless there is good reason to segregate them. Speech is recognised by matching the low-resolution representation to previously learned speech templates without prior segregation of the signal into separate perceptual streams; this contrasts with the approach generally used by computational models of ASA. We describe an implementation of the multi-resolution model, using hidden Markov models, that illustrates the feasibility of this approach and achieves much higher identification performance than standard techniques used for computer recognition of speech mixed with other sounds.
Bibliographic reference. Harding, Sue / Meyer, Georg (2003): "Multi-resolution auditory scene analysis: robust speech recognition using pattern-matching from a noisy signal", In EUROSPEECH-2003, 2109-2112.