Spectrotemporal representation of speech has already shown promising results in speech processing technologies, however, many inherent issues of such representation, such as high dimensionality have limited their use in speech and speaker recognition. Multistream framework fits very well to such representation where different regions can be separately mapped into posterior probabilities of classes before merging. In this study, we investigated the effective ways of forming streams out of this representation for robust phoneme recognition. We also investigated multiple ways of fusing the posteriors of different streams based on their individual confidence or interactions between them. We observed an improvement of 8.6% relative improvement in clean and 4% in noise. We developed a simple yet effective linear combination technique that provides intuitive understanding of stream combinations and how even systematic errors can be leant to reduce confusions.
Bibliographic reference. Mesgarani, Nima / Thomas, Samuel / Hermansky, Hynek (2010): "A multistream multiresolution framework for phoneme recognition", In INTERSPEECH-2010, 318-321.