Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge

Bolaji Yusuf, Alican Gök, Batuhan Gundogdu, Oyku Deniz Kose, Murat Saraclar

Zero-resource speech processing efforts focus on unsupervised discovery of sub-word acoustic units. Common approaches work with spatial similarities between the acoustic frame representations within Bayesian or neural network-based frameworks. We propose two methods that utilize the temporal proximity information in addition to the acoustic similarity for clustering frames into acoustic units. The first approach uses a temporally biased self-organizing map (SOM) to discover such units. Since the SOM unit indices are correlated with (vector) spatial distance, we pool neighboring units and then train a recurrent neural network to predict each pooled unit. The second approach incorporates temporal awareness by training a recurrent sparse autoencoder, in which unsupervised clustering is done on the intermediate softmax layer. This network is then fine-tuned using aligned pairs of acoustically similar sequences obtained via unsupervised term discovery. Our approaches outperform the provided baseline system on two main metrics of the Zerospeech 2019 challenge, ABX-discriminability and bitrate of the quantized embeddings, both for English and the surprise language. Furthermore, the temporal-awareness and the post-filtering techniques adopted in this work resulted in an enhanced continuity of the decoding, yielding low bitrates.

 DOI: 10.21437/Interspeech.2019-1430

Cite as: Yusuf, B., Gök, A., Gundogdu, B., Kose, O.D., Saraclar, M. (2019) Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge. Proc. Interspeech 2019, 1098-1102, DOI: 10.21437/Interspeech.2019-1430.

  author={Bolaji Yusuf and Alican Gök and Batuhan Gundogdu and Oyku Deniz Kose and Murat Saraclar},
  title={{Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge}},
  booktitle={Proc. Interspeech 2019},