BERTphone: Phonetically-aware Encoder Representations for Utterance-level Speaker and Language Recognition

Shaoshi Ling, Julian Salazar, Yuzong Liu, Katrin Kirchhoff


We introduce BERTphone, a Transformer encoder trained on large speech corpora that outputs phonetically-aware contextual representation vectors that can be used for both speaker and language recognition. This is accomplished by training on two objectives: the frst, inspired by adapting BERT to the continuous domain, involves masking spans of input frames and reconstructing the whole sequence for acoustic representation learning; the second, inspired by the success of bottleneck features from ASR, is a sequence-level CTC loss applied to phoneme labels for phonetic representation learning. We pretrain two B E RT P H O N E models (one on Fisher and one on TED-LIUM) and use them as feature extractors into x-vector-style DNNs for both tasks. We attain a state-of-the-art C_avg of 6.16 on the challenging LRE07 3sec closed-set language recognition task. On Fisher and VoxCeleb speaker recognition tasks, we see an 18% relative reduction in speaker EER when training on BERTphone vectors instead of MFCCs. In general, BERTphone outperforms previous phonetic pretraining approaches on the same data.


 DOI: 10.21437/Odyssey.2020-2

Cite as: Ling, S., Salazar, J., Liu, Y., Kirchhoff, K. (2020) BERTphone: Phonetically-aware Encoder Representations for Utterance-level Speaker and Language Recognition. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 9-16, DOI: 10.21437/Odyssey.2020-2.


@inproceedings{Ling2020,
  author={Shaoshi Ling and Julian Salazar and Yuzong Liu and Katrin Kirchhoff},
  title={{BERTphone: Phonetically-aware Encoder Representations for Utterance-level Speaker and Language Recognition}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
  pages={9--16},
  doi={10.21437/Odyssey.2020-2},
  url={http://dx.doi.org/10.21437/Odyssey.2020-2}
}