10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Discovering Keywords from Cross-Modal Input: Ecological vs. Engineering Methods for Enhancing Acoustic Repetitions

Guillaume Aimetti (1), Roger K. Moore (1), L. ten Bosch (2), Okko Johannes Räsänen (3), Unto Kalervo Laine (3)

(1) University of Sheffield, UK
(2) Radboud Universiteit Nijmegen, The Netherlands
(3) Helsinki University of Technology, Finland

This paper introduces a computational model that automatically segments acoustic speech data and builds internal representations of keyword classes from cross-modal (acoustic and pseudo-visual) input. Acoustic segmentation is achieved using a novel dynamic time warping technique and the focus of this paper is on recent investigations conducted to enhance the identification of repeating portions of speech. This ongoing research is inspired by current cognitive views of early language acquisition and therefore strives for ecological plausibility in an attempt to build more robust speech recognition systems. Results show that an ad-hoc computationally engineered solution can aid the discovery of repeating acoustic patterns. However, we show that this improvement can be simulated in a more ecologically valid way.

Full Paper

Bibliographic reference.  Aimetti, Guillaume / Moore, Roger K. / Bosch, L. ten / Räsänen, Okko Johannes / Laine, Unto Kalervo (2009): "Discovering keywords from cross-modal input: ecological vs. engineering methods for enhancing acoustic repetitions", In INTERSPEECH-2009, 1171-1174.