This paper describes the systems developed by the DKU-Duke-Lenovo team for the Fearless Steps Challenge Phase III. For the speech activity detection (SAD) task, we employ the U-Net-based model which has not been used for SAD before, observing a DCF of 1.915% on the eval set. For the speaker identification (SID) task, we adopt the ResNet-SE and ECAPA-TDNN model, and we obtain a Top-5 accuracy of 86.21%. For the speaker diarization (SD) task, we employ several different clustering methods. Besides, domain adaptation, system fusion, and Target-Speaker Voice Activity Detection (TS-VAD) significantly improve the SD performance. We obtain a DER of 12.32% on track 2, and the major contribution is from our ResNet-based TS-VAD model. We finally achieve a first-place ranking for SD and SID and a second-place for SAD in the challenge.
Cite as: Wang, W., Cai, D., Wang, J., Lin, Q., Wang, X., Hong, M., Li, M. (2021) The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III. Proc. Interspeech 2021, 1044-1048, doi: 10.21437/Interspeech.2021-235
@inproceedings{wang21i_interspeech, author={Weiqing Wang and Danwei Cai and Jin Wang and Qingjian Lin and Xuyang Wang and Mi Hong and Ming Li}, title={{The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={1044--1048}, doi={10.21437/Interspeech.2021-235} }