ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Speech Activity Detection Based on Multilingual Speech Recognition System

Seyyed Saeed Sarfjoo, Srikanth Madikeri, Petr Motlicek

To better model the contextual information and increase the generalization ability of the Speech Activity Detection (SAD) system, this paper leverages a multilingual Automatic Speech Recognition (ASR) system to perform SAD. Sequence-discriminative training of Acoustic Model (AM) using Lattice-Free Maximum Mutual Information (LF-MMI) loss function, effectively extracts the contextual information of the input acoustic frame. Multilingual AM training causes the robustness to noise and language variabilities. The index of maximum output posterior is considered as a frame-level speech/non-speech decision function. Majority voting and logistic regression are applied to fuse the language-dependent decisions. The multilingual ASR is trained on 18 languages of BABEL datasets and the built SAD is evaluated on 3 different languages. On out-of-domain datasets, the proposed SAD model shows significantly better performance with respect to baseline models. On the Ester2 dataset, without using any in-domain data, this model outperforms the WebRTC, phoneme recognizer based VAD (Phn_Rec), and Pyannote baselines (respectively by 7.1, 1.7, and 2.7% absolute) in Detection Error Rate (DetER) metrics. Similarly, on the LiveATC dataset, this model outperforms the WebRTC, Phn_Rec, and Pyannote baselines (respectively by 6.4, 10.0, and 3.7% absolutely) in DetER metrics.

doi: 10.21437/Interspeech.2021-1058

Cite as: Sarfjoo, S.S., Madikeri, S., Motlicek, P. (2021) Speech Activity Detection Based on Multilingual Speech Recognition System. Proc. Interspeech 2021, 4369-4373, doi: 10.21437/Interspeech.2021-1058

  author={Seyyed Saeed Sarfjoo and Srikanth Madikeri and Petr Motlicek},
  title={{Speech Activity Detection Based on Multilingual Speech Recognition System}},
  booktitle={Proc. Interspeech 2021},