Interspeech'2005 - Eurospeech

Lisbon, Portugal
September 4-8, 2005

Non-Linear Estimation of Voice Activity to Improve Automatic Recognition of Noisy Speech

Roberto Gemello (1), Franco Mana (1), Renato de Mori (2)

(1) Loquendo, Italy; (2) LIA-CNRS, Avignon, France

Feed-forward multi-layer perceptrons (MLP) and recurrent neural networks (RNN) fed with different sets of acoustic features are proposed for computing the presence and absence of speech in continuous speech signal in presence of various levels of background noise. Detailed performance evaluations on voice activity detection (VAD) are reported using the Aurora2, Aurora3 and TIMIT corpora. It is shown that the best results are obtained with an RNN fed by the acoustic features used for automatic speech recognition (ASR) augmented by specific features. Detailed evaluations are also proposed for ASR using Aurora2 and the German, Italian and Spanish portions of the test set of the Aurora3 corpus. The highest word error rate (WER) reduction (16.9%) is obtained when the only-noise presence probability is used to modify the phone posterior probabilities used for speech decoding.

Full Paper

Bibliographic reference.  Gemello, Roberto / Mana, Franco / Mori, Renato de (2005): "Non-linear estimation of voice activity to improve automatic recognition of noisy speech", In INTERSPEECH-2005, 2617-2620.