Survey Talk: End-to-End Deep Neural Network Based Speaker and Language Recognition

Ming Li, Weicheng Cai, Danwei Cai


Speech signal not only contains lexicon information, but also delivers various kinds of paralinguistic speech attribute information, such as speaker, language, gender, age, emotion, etc. The core technique question behind it is utterance level supervised learning based on text independent or text dependent speech signal with flexible duration. In section 1, we will first formulate the problem of speaker and language recognition. In section 2, we introduce the traditional framework with different modules in a pipeline, namely, feature extraction, representation, variability compensation and backend classification. Then we naturally introduce the end-to-end idea and compare with the traditional framework. We will show the correspondence between feature extraction and CNN layers, representation and encoding layer, backend modeling and fully connected layers. Specifically, we will introduce the modules in the end-to-end frameworks with more details here, e.g. variable length data loader, frontend convolutional network structure design, encoding (or pooling) layer design, loss function design, data augmentation design, transfer learning and multitask learning, etc. In section 4, we will introduce some robust methods using the end-to-end framework for far-field and noisy conditions. Finally, we will connect the introduced end-to-end frameworks to other related tasks, e.g. speaker diarization, paralinguistic speech attribute recognition, anti-spoofing countermeasures, etc.


Cite as: Li, M., Cai, W., Cai, D. (2019) Survey Talk: End-to-End Deep Neural Network Based Speaker and Language Recognition. Proc. Interspeech 2019.


@inproceedings{Li2019,
  author={Ming Li and Weicheng Cai and Danwei Cai},
  title={{Survey Talk: End-to-End Deep Neural Network Based Speaker and Language Recognition}},
  year=2019,
  booktitle={Proc. Interspeech 2019}
}