Spoken Keyword Detection Using Joint DTW-CNN

Ravi Shankar, Vikram C M, S R Mahadeva Prasanna


A method to detect spoken keywords in a given speech utterance is proposed, called as joint Dynamic Time Warping (DTW)- Convolution Neural Network (CNN). It is a combination of DTW approach with a strong classifier like CNN. Both these methods have independently shown significant results in solving problems related to optimal sequence alignment and object recognition, respectively. The proposed method modifies the original DTW formulation and converts the warping matrix into a gray scale image. A CNN is trained on these images to classify the presence or absence of keyword by identifying the texture of warping matrix. The TIMIT corpus has been used for conducting experiments and our method shows significant improvement over other existing techniques.


 DOI: 10.21437/Interspeech.2018-1436

Cite as: Shankar, R., C M, V., Prasanna, S.R.M. (2018) Spoken Keyword Detection Using Joint DTW-CNN. Proc. Interspeech 2018, 117-121, DOI: 10.21437/Interspeech.2018-1436.


@inproceedings{Shankar2018,
  author={Ravi Shankar and Vikram {C M} and S R Mahadeva Prasanna},
  title={Spoken Keyword Detection Using Joint DTW-CNN},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={117--121},
  doi={10.21437/Interspeech.2018-1436},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1436}
}