Synthesis of Device-Independent Noise Corpora for Realistic ASR Evaluation

Hannes Gamper, Mark R.P. Thomas, Lyle Corbin, Ivan Tashev

In order to effectively evaluate the accuracy of automatic speech recognition (ASR) with a novel capture device, it is important to create a realistic test data corpus that is representative of real-world noise conditions. Typically, this involves either recording the output of a device under test (DUT) in a noisy environment, or synthesizing an environment over loudspeakers in a way that simulates realistic signal-to-noise ratios (SNRs), reverberation times, and spatial noise distributions. Here we propose a method that aims at combining the realism of in-situ recordings with the convenience and repeatability of synthetic corpora. A device-independent spatial recording containing noise and speech is combined with the measured directivity pattern of a DUT to generate a synthetic test corpus for evaluating the performance of an ASR system. This is achieved by a spherical harmonic decomposition of both the sound field and the DUT’s directivity patterns. Experimental results suggest that the proposed method can be a viable alternative to costly and cumbersome device-dependent measurements. The proposed simulation method predicted the SNR of the DUT response to within about 3 dB and the word error rate (WER) to within about 20%, across a range of test SNRs, target source directions, and noise types.

DOI: 10.21437/Interspeech.2016-978

Cite as

Gamper, H., Thomas, M.R., Corbin, L., Tashev, I. (2016) Synthesis of Device-Independent Noise Corpora for Realistic ASR Evaluation. Proc. Interspeech 2016, 2791-2795.

author={Hannes Gamper and Mark R.P. Thomas and Lyle Corbin and Ivan Tashev},
title={Synthesis of Device-Independent Noise Corpora for Realistic ASR Evaluation},
booktitle={Interspeech 2016},