Listeners can reliably identify speech in noisy conditions, although it is generally not known what specific features of speech are used to do this. We utilize a recently introduced data-driven framework to identify these features. By analyzing listening-test results involving the same speech utterance mixed with many different noise instances, the framework is able to compute the importance of each time-frequency point in the utterance to its intelligibility. This paper shows that a trained model resulting from this framework can generalize to new conditions, successfully predicting the intelligibility of novel mixtures. First, it can generalize to novel noise instances after being trained on mixtures involving the same speech utterance but different noises. Second, it can generalize to novel talkers after being trained on mixtures involving the same syllables produced by different talkers in different noises. Finally, it can generalize to novel phonemes, after being trained on mixtures involving different consonants produced by the same or different talkers in different noises. Aligning the clean utterances in time and then propagating this alignment to the features used in the intelligibility prediction improves this generalization performance further.
Bibliographic reference. Mandel, Michael I. / Yoho, Sarah E. / Healy, Eric W. (2014): "Generalizing time-frequency importance functions across noises, talkers, and phonemes", In INTERSPEECH-2014, 2016-2020.