Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings

Saurabhchand Bhati, Shekhar Nayak, K. Sri Rama Murty, Najim Dehak


Unsupervised discovery of acoustic units from the raw speech signal forms the core objective of zero-resource speech processing. It involves identifying the acoustic segment boundaries and consistently assigning unique labels to acoustically similar segments. In this work, the possible candidates for segment boundaries are identified in an unsupervised manner from the kernel Gram matrix computed from the Mel-frequency cepstral coefficients (MFCC). These segment boundary candidates are used to train a siamese network, that is intended to learn embeddings that minimize intrasegment distances and maximize the intersegment distances. The siamese embeddings capture phonetic information from longer contexts of the speech signal and enhance the intersegment discriminability. These properties make the siamese embeddings better suited for acoustic segmentation and clustering than the raw MFCC features. The Gram matrix computed from the siamese embeddings provides unambiguous evidence for boundary locations. The initial candidate boundaries are refined using this evidence, and siamese embeddings are extracted for the new acoustic segments. A graph growing approach is used to cluster the siamese embeddings, and a unique label is assigned to acoustically similar segments. The performance of the proposed method for acoustic segmentation and clustering is evaluated on Zero Resource 2017 database.


 DOI: 10.21437/Interspeech.2019-2981

Cite as: Bhati, S., Nayak, S., Murty, K.S.R., Dehak, N. (2019) Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings. Proc. Interspeech 2019, 2668-2672, DOI: 10.21437/Interspeech.2019-2981.


@inproceedings{Bhati2019,
  author={Saurabhchand Bhati and Shekhar Nayak and K. Sri Rama Murty and Najim Dehak},
  title={{Unsupervised Acoustic Segmentation and Clustering Using Siamese Network Embeddings}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={2668--2672},
  doi={10.21437/Interspeech.2019-2981},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2981}
}