Speaker Characterization Using TDNN, TDNN-LSTM, TDNN-LSTM-Attention based Speaker Embeddings for NIST SRE 2019

Chien-Lin Huang


In this paper, we explore speaker characterization using the time-delay neural network, long short-term memory neural network, and attention (TDNN-LSTM-Attention) based speaker embedding. The speaker embeddings of TDNN, TDNN-LSTM, TDNN-LSTM-Attention are investigated on a large scale of train and testing datasets. Different types of front-end feature extraction are investigated to find good features for speaker embedding. To increase the amount and diversity of the training data, 4 kinds of data augmentation are used to create 7 new copies of the original data. The proposed methods are evaluated with the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) tasks. Experimental results show that the proposed methods achieve the minimum decision cost function of 0.372 and 0.392 with the NIST SRE 2018 and SRE 2019 evaluation datasets, respectively.


 DOI: 10.21437/Odyssey.2020-60

Cite as: Huang, C. (2020) Speaker Characterization Using TDNN, TDNN-LSTM, TDNN-LSTM-Attention based Speaker Embeddings for NIST SRE 2019. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 423-427, DOI: 10.21437/Odyssey.2020-60.


@inproceedings{Huang2020,
  author={Chien-Lin Huang},
  title={{Speaker Characterization Using TDNN, TDNN-LSTM, TDNN-LSTM-Attention based Speaker Embeddings for NIST SRE 2019}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
  pages={423--427},
  doi={10.21437/Odyssey.2020-60},
  url={http://dx.doi.org/10.21437/Odyssey.2020-60}
}