ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network

Midia Yousefi, John H.L. Hansen

Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Next, the attention mechanism summarizes the extracted information into a compact feature vector without losing critical information. Finally, the active speakers are classified using a fully connected network. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling. The proposed Attention-guided CNN achieves 76.15% for both Weighted Accuracy and average Recall, and 75.80% Precision on speech segments as short as 20 frames (i.e., 200 ms). All the classification metrics exceed 92% for the attention-guided model in offline scenarios where the input signal is more than 100 frames long (i.e., 1s).


doi: 10.21437/Interspeech.2021-331

Cite as: Yousefi, M., Hansen, J.H.L. (2021) Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network. Proc. Interspeech 2021, 1484-1488, doi: 10.21437/Interspeech.2021-331

@inproceedings{yousefi21_interspeech,
  author={Midia Yousefi and John H.L. Hansen},
  title={{Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1484--1488},
  doi={10.21437/Interspeech.2021-331}
}