Most current speech technology systems are designed to operate well even in the presence of multiple active speakers. However, most solutions assume that the number of co-current speakers is known. Unfortunately, this information might not always be available in real-world applications. In this study, we propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech. The proposed system extracts higher-level information from the speech spectral content using a CNN model. Next, the attention mechanism summarizes the extracted information into a compact feature vector without losing critical information. Finally, the active speakers are classified using a fully connected network. Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling. The proposed Attention-guided CNN achieves 76.15% for both Weighted Accuracy and average Recall, and 75.80% Precision on speech segments as short as 20 frames (i.e., 200 ms). All the classification metrics exceed 92% for the attention-guided model in offline scenarios where the input signal is more than 100 frames long (i.e., 1s).
Cite as: Yousefi, M., Hansen, J.H.L. (2021) Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network. Proc. Interspeech 2021, 1484-1488, doi: 10.21437/Interspeech.2021-331
@inproceedings{yousefi21_interspeech, author={Midia Yousefi and John H.L. Hansen}, title={{Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={1484--1488}, doi={10.21437/Interspeech.2021-331} }