ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Multi-Channel VAD for Transcription of Group Discussion

Osamu Ichikawa, Kaito Nakano, Takahiro Nakayama, Hajime Shirouzu

Attempts are being made to visualize the learning process by attaching microphones to students participating in group works conducted in classrooms, and subsequently, their speech using an automatic speech recognition (ASR) system. However, the voices of nearby students frequently become mixed with the output speech data, even when using close-talk microphones with noise robustness. To resolve this challenge, in this paper, we propose using multi-channel voice activity detection (VAD) to determine the speech segments of a target speaker while also referencing the output speech from the microphones attached to the other speakers in the group. The conducted evaluation experiments using the actual speech of middle school students during group work lessons showed that our proposed method significantly improves the frame error rate (38.7%) compared to that of the conventional technology, single-channel VAD (49.5%). In our view, conventional approaches, such as distributed microphone arrays and deep learning, are somewhat dependent on the temporal stationarity of the speakers’ positions. However, the proposed method is essentially a VAD process and thus works robustly. It is the practical and proven solution in a real classroom environment.

doi: 10.21437/Interspeech.2021-200

Cite as: Ichikawa, O., Nakano, K., Nakayama, T., Shirouzu, H. (2021) Multi-Channel VAD for Transcription of Group Discussion. Proc. Interspeech 2021, 336-340, doi: 10.21437/Interspeech.2021-200

  author={Osamu Ichikawa and Kaito Nakano and Takahiro Nakayama and Hajime Shirouzu},
  title={{Multi-Channel VAD for Transcription of Group Discussion}},
  booktitle={Proc. Interspeech 2021},