LSTM Based Attentive Fusion of Spectral and Prosodic Information for Keyword Spotting in Hindi Language

Laxmi Pandey, Karan Nathwani


In this paper, a DNN based keyword spotting framework, that utilizes both spectral as well as prosodic information present in the speech signal, is proposed. A DNN is first trained to learn a set of hierarchical non-linear transformation parameters that project the original spectral and prosodic feature vectors onto a feature space where the distance between similar syllable pairs is small and between dissimilar syllable pairs is large. These transformed features are then fused using an attention-based long short-term memory (LSTM) network. As a side result, a deep denoising autoencoder based fine-tuning technique is used to improve the performance of sequence predictions. A sequence matching method called the sliding syllable protocol is also developed for keyword spotting. Syllable recognition and keyword spotting (KWS) experiments are conducted specifically for the Hindi language which is one of the widely spoken languages across the globe but is not addressed significantly by the speech processing community. The proposed framework indicates reasonable improvements when compared to baseline methods available in the literature.


 DOI: 10.21437/Interspeech.2018-1016

Cite as: Pandey, L., Nathwani, K. (2018) LSTM Based Attentive Fusion of Spectral and Prosodic Information for Keyword Spotting in Hindi Language. Proc. Interspeech 2018, 112-116, DOI: 10.21437/Interspeech.2018-1016.


@inproceedings{Pandey2018,
  author={Laxmi Pandey and Karan Nathwani},
  title={LSTM Based Attentive Fusion of Spectral and Prosodic Information for Keyword Spotting in Hindi Language},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={112--116},
  doi={10.21437/Interspeech.2018-1016},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1016}
}