Towards Unsupervised Learning of Speech Representations

Mirco Ravanelli


The success of deep learning techniques strongly depends on the quality of the representations that are automatically discovered from data. These representations should capture intermediate concepts, features, or latent variables, and are commonly learned in a supervised way using large annotated corpora. Even though this is still the dominant paradigm, some crucial limitations arise. Collecting large amounts of annotated examples, for instance, is very costly and time-consuming. Moreover, supervised representations are likely to be biased toward the considered problem, possibly limiting their exportability to other problems and applications. A natural way to mitigate these issues is unsupervised learning. Unsupervised learning attempts to extract knowledge from unlabeled data, and can potentially discover representations that capture the underlying structure of such data. This modality, sometimes referred to as self-supervised learning, is gaining popularity within the computer vision community, while its application on high-dimensional and long temporal sequences like speech still remains challenging. In this keynote, I will summarize some recent efforts to learn general, robust, and transferrable speech representations using unsupervised/self-supervised approaches. In particular, I will focus on a novel technique called Local Info Max (LIM),that learns speech representations using a maximum mutual information approach. I will then introduce the recently-proposed problem-agnostic speech encoder (PASE) that is derived by jointly solving multiple self-supervised tasks. PASE is a first step towards a universal neural speech encoder and turned out to be useful for a large variety of applications such as speech recognition, speaker identification, and emotion recognition.


Cite as: Ravanelli, M. (2020) Towards Unsupervised Learning of Speech Representations. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop.


@inproceedings{Ravanelli2020,
  author={Mirco Ravanelli},
  title={{Towards Unsupervised Learning of Speech Representations}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop}
}