This paper addresses the problem of time-varying channels in speech-recognition-based human-robot interaction using Locally-Normalized Filter-Bank features (LNFB), and training strategies that compensate for microphone response and room acoustics. Testing utterances were generated by re-recording the Aurora-4 testing database using a PR2 mobile robot, equipped with a Kinect audio interface while performing head rotations and movements toward and away from a fixed source. Three training conditions were evaluated called Clean, 1-IR and 33-IR. With Clean training, the DNN-HMM system was trained using the Aurora-4 clean training database. With 1-IR training, the same training data were convolved with an impulse response estimated at one meter from the source with no rotation of the robot head. With 33-IR training, the Aurora-4 training data were convolved with impulse responses estimated at one, two and three meters from the source and 11 angular positions of the robot head. The 33-IR training method produced reductions in WER greater than 50% when compared with Clean training using both LNFB and conventional Mel filterbank features. Nevertheless, LNFB features provided a WER 23% lower than MelFB using 33-IR training. The use of 33-IR training and LNFB features reduced WER by 64% compared to Clean training and MelFB features.
Cite as: Novoa, J., Wuth, J., Escudero, J.P., Fredes, J., Mahu, R., Stern, R.M., Yoma, N.B. (2017) Robustness Over Time-Varying Channels in DNN-HMM ASR Based Human-Robot Interaction. Proc. Interspeech 2017, 839-843, doi: 10.21437/Interspeech.2017-1308
@inproceedings{novoa17_interspeech, author={José Novoa and Jorge Wuth and Juan Pablo Escudero and Josué Fredes and Rodrigo Mahu and Richard M. Stern and Nestor Becerra Yoma}, title={{Robustness Over Time-Varying Channels in DNN-HMM ASR Based Human-Robot Interaction}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={839--843}, doi={10.21437/Interspeech.2017-1308} }