Jointly Learning to Locate and Classify Words Using Convolutional Networks

Dimitri Palaz, Gabriel Synnaeve, Ronan Collobert


In this paper, we propose a novel approach for weakly-supervised word recognition. Most state of the art automatic speech recognition systems are based on frame-level labels obtained through forced alignments or through a sequential loss. Recently, weakly-supervised trained models have been proposed in vision, that can learn which part of the input is relevant for classifying a given pattern [1]. Our system is composed of a convolutional neural network and a temporal score aggregation mechanism. For each sentence, it is trained using as supervision only some of the words (most frequent) that are present in a given sentence, without knowing their order nor quantity. We show that our proposed system is able to jointly classify and localize words. We also evaluate the system on a keyword spotting task, and show that it can yield similar performance to strong supervised HMM/GMM baseline.


DOI: 10.21437/Interspeech.2016-968

Cite as

Palaz, D., Synnaeve, G., Collobert, R. (2016) Jointly Learning to Locate and Classify Words Using Convolutional Networks. Proc. Interspeech 2016, 2741-2745.

Bibtex
@inproceedings{Palaz+2016,
author={Dimitri Palaz and Gabriel Synnaeve and Ronan Collobert},
title={Jointly Learning to Locate and Classify Words Using Convolutional Networks},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-968},
url={http://dx.doi.org/10.21437/Interspeech.2016-968},
pages={2741--2745}
}