CNN Based Query by Example Spoken Term Detection

Dhananjay Ram, Lesly Miculicich, Hervé Bourlard


In this work, we address the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State of the art solutions usually rely on dynamic time warping (DTW) based template matching. In contrast, we propose here to tackle the problem as binary classification of images. Similar to the DTW approach, we rely on deep neural network (DNN) based posterior probabilities as feature vectors. The posteriors from a spoken query and a test utterance are used to compute frame-level similarities in a matrix form. This matrix contains somewhere a quasi-diagonal pattern if the query occurs in the test utterance. We propose to use this matrix as an image and train a convolutional neural network (CNN) for identifying the pattern and make a decision about the occurrence of the query. This language independent system is evaluated on SWS 2013 and is shown to give 10% relative improvement over a highly competitive baseline system based on DTW. Experiments on QUESST 2014 database gives similar improvements showing that the approach generalizes to other database as well.


 DOI: 10.21437/Interspeech.2018-1722

Cite as: Ram, D., Miculicich, L., Bourlard, H. (2018) CNN Based Query by Example Spoken Term Detection. Proc. Interspeech 2018, 92-96, DOI: 10.21437/Interspeech.2018-1722.


@inproceedings{Ram2018,
  author={Dhananjay Ram and Lesly Miculicich and Hervé Bourlard},
  title={CNN Based Query by Example Spoken Term Detection},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={92--96},
  doi={10.21437/Interspeech.2018-1722},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1722}
}