INTERSPEECH 2013
14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Automatic Self-Supervised Learning of Associations Between Speech and Text

Juho Knuuttila, Okko Räsänen, Unto K. Laine

Aalto University, Finland

Discovery of statistically significant patterns from data and learning of associative links between qualitatively different data streams is becoming increasingly important in dealing with the so-called Big Data problem of the modern society. In this work, a methodological framework for automatic discovery of statistical associations between a high bit-rate and noisy sensory signal (speech) and temporally discrete categorical data with different temporal granularity (text) is presented. The proposed approach does not utilize any phonetic or linguistic knowledge in the analysis, but simply learns the meaningful units of text and speech and their mutual mappings in an unsupervised manner. The first experiments with a limited vocabulary of child-directed speech show that, after a period of learning, the method is successful in the generation of a textual representation of continuous speech.

Full Paper

Bibliographic reference.  Knuuttila, Juho / Räsänen, Okko / Laine, Unto K. (2013): "Automatic self-supervised learning of associations between speech and text", In INTERSPEECH-2013, 465-469.