Interspeech'2005 - Eurospeech
The paper addresses the problem of recognising speech in the presence of a competing speaker. It uses a two stage ‘Speech Fragment Decoding' system. The system works by first segmenting a spectro-temporal representation of the mixture into a number of fragments, such that each fragment is dominated by a single source. An ASR search is then extended to find the combination of speech model sequence and fragment subset that best fits a set of clean speech models. This paper extends previous work by combining ‘Speech Fragment Decoding' with soft missing data techniques to better handle spectro-temporal regions that cannot be confidently ascribed to either foreground or background. Recognition experiments are performed on a connected digit task using 0 db mixtures of simultaneous mixed-gender speakers. The incorporation of soft decisions leads to an increase in system performance from 66.9% to 72.2%.
Bibliographic reference. Coy, André / Barker, Jon (2005): "Soft harmonic masks for recognising speech in the presence of a competing speaker", In INTERSPEECH-2005, 2641-2644.