SparseSpeech: Unsupervised Acoustic Unit Discovery with Memory-Augmented Sequence Autoencoders

Benjamin Milde, Chris Biemann


We propose a sparse sequence autoencoder model for unsupervised acoustic unit discovery, based on bidirectional LSTM encoders/decoders with a sparsity-inducing bottleneck. The sparsity layer is based on memory-augmented neural networks, with a differentiable embedding memory bank addressed from the encoder. The decoder reconstructs the encoded input feature sequence from an utterance-level context embedding and the bottleneck representation. At some time steps, the input to the decoder is randomly omitted by applying sequence dropout, forcing the decoder to learn about the temporal structure of the sequence. We propose a bootstrapping training procedure, after which the network can be trained end-to-end with standard back-propagation. Sparsity of the generated representation can be controlled with a parameter in the proposed loss function. We evaluate the units with the ABX discriminability on minimal triphone pairs and also on entire words. Forcing the network to favor highly sparse memory addressings in the memory component yields symbolic-like representations of speech that are very compact and still offer better ABX discriminability than MFCC.


 DOI: 10.21437/Interspeech.2019-2938

Cite as: Milde, B., Biemann, C. (2019) SparseSpeech: Unsupervised Acoustic Unit Discovery with Memory-Augmented Sequence Autoencoders. Proc. Interspeech 2019, 256-260, DOI: 10.21437/Interspeech.2019-2938.


@inproceedings{Milde2019,
  author={Benjamin Milde and Chris Biemann},
  title={{SparseSpeech: Unsupervised Acoustic Unit Discovery with Memory-Augmented Sequence Autoencoders}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={256--260},
  doi={10.21437/Interspeech.2019-2938},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2938}
}