ISCA Archive Interspeech 2021
ISCA Archive Interspeech 2021

Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks

Herman Kamper, Benjamin van Niekerk

We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the first, features are greedily merged until a prespecified number of segments are reached. The second uses dynamic programming to optimize a squared error with a penalty term to encourage fewer but longer segments. We show that these VQ segmentation methods can be used without alteration across a wide range of tasks: unsupervised phone segmentation, ABX phone discrimination, same-different word discrimination, and as inputs to a symbolic word segmentation algorithm. The penalized dynamic programming method generally performs best. While performance on individual tasks is only comparable to the state-of-the-art in some cases, in all tasks a reasonable competing approach is outperformed at a substantially lower bitrate.


doi: 10.21437/Interspeech.2021-50

Cite as: Kamper, H., Niekerk, B.v. (2021) Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks. Proc. Interspeech 2021, 1539-1543, doi: 10.21437/Interspeech.2021-50

@inproceedings{kamper21_interspeech,
  author={Herman Kamper and Benjamin van Niekerk},
  title={{Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks}},
  year=2021,
  booktitle={Proc. Interspeech 2021},
  pages={1539--1543},
  doi={10.21437/Interspeech.2021-50}
}