Discovery of statistically significant patterns from data and learning of associative links between qualitatively different data streams is becoming increasingly important in dealing with the so-called Big Data problem of the modern society. In this work, a methodological framework for automatic discovery of statistical associations between a high bit-rate and noisy sensory signal (speech) and temporally discrete categorical data with different temporal granularity (text) is presented. The proposed approach does not utilize any phonetic or linguistic knowledge in the analysis, but simply learns the meaningful units of text and speech and their mutual mappings in an unsupervised manner. The first experiments with a limited vocabulary of child-directed speech show that, after a period of learning, the method is successful in the generation of a textual representation of continuous speech.
Bibliographic reference. Knuuttila, Juho / Räsänen, Okko / Laine, Unto K. (2013): "Automatic self-supervised learning of associations between speech and text", In INTERSPEECH-2013, 465-469.