In this paper we investigate environment feature representations, which we refer to as e-vectors, that can be used for environment adaption in automatic speech recognition (ASR), and for environment identification. Inspired by the fact that i-vectors in the total variability space capture both speaker and channel environment variability, our proposed e-vectors are extracted from i-vectors. Two extraction methods are proposed: one is via linear discriminant analysis (LDA) projection, and the other via a bottleneck deep neural network (BN-DNN). Our evaluations show that by augmenting DNN-HMM ASR systems with the proposed e-vectors for environment adaptation, ASR performance is significantly improved. We also demonstrate that the proposed e-vector yields promising results on environment identification.
Cite as: Feng, X., Richardson, B., Amman, S., Glass, J. (2017) An Environmental Feature Representation for Robust Speech Recognition and for Environment Identification. Proc. Interspeech 2017, 3078-3082, doi: 10.21437/Interspeech.2017-485
@inproceedings{feng17b_interspeech, author={Xue Feng and Brigitte Richardson and Scott Amman and James Glass}, title={{An Environmental Feature Representation for Robust Speech Recognition and for Environment Identification}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3078--3082}, doi={10.21437/Interspeech.2017-485} }