This paper introduces a computational model that automatically segments acoustic speech data and builds internal representations of keyword classes from cross-modal (acoustic and pseudo-visual) input. Acoustic segmentation is achieved using a novel dynamic time warping technique and the focus of this paper is on recent investigations conducted to enhance the identification of repeating portions of speech. This ongoing research is inspired by current cognitive views of early language acquisition and therefore strives for ecological plausibility in an attempt to build more robust speech recognition systems. Results show that an ad-hoc computationally engineered solution can aid the discovery of repeating acoustic patterns. However, we show that this improvement can be simulated in a more ecologically valid way.
Bibliographic reference. Aimetti, Guillaume / Moore, Roger K. / Bosch, L. ten / Räsänen, Okko Johannes / Laine, Unto Kalervo (2009): "Discovering keywords from cross-modal input: ecological vs. engineering methods for enhancing acoustic repetitions", In INTERSPEECH-2009, 1171-1174.