Convolutional Recurrent Neural Networks for Speech Activity Detection in Naturalistic Audio from Apollo Missions

Pablo Gimeno, Dayana Ribas, Alfonso Ortega, Antonio Miguel, Eduardo Lleida

Speech Activity Detection (SAD) aims to correctly distinguish audio segments containing human speech. Several solutions have been successfully applied to the SAD task, with deep learning approaches being specially relevant nowadays. This paper describes a SAD solution based on Convolutional Recurrent Neural Networks (CRNN) presented as the ViVoLab submission to the 2020 Fearless steps challenge. The dataset used comes from the audio of Apollo space missions, presenting a challenging domain with strong degradation and several transmission noises. First, we explore the performance of 1D and 2D convolutional processing stages. Then we propose a novel architecture that executes the fusion of two convolutional feature maps by combining the information captured with 1D and 2D filters. Obtained results largely outperform the baseline provided by the organization. They were able to achieve a detection cost function below 2% on the development set for all configurations. Best results were reported on the presented fusion architecture, with a DCF metric of 1.78% on the evaluation set and ranking fourth among all the participant teams in the challenge SAD task.

doi: 10.21437/IberSPEECH.2021-6

Gimeno, P, Ribas, D, Ortega, A, Miguel, A, Lleida, E (2021) Convolutional Recurrent Neural Networks for Speech Activity Detection in Naturalistic Audio from Apollo Missions. Proc. IberSPEECH 2021, 26-30, doi: 10.21437/IberSPEECH.2021-6.