Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

Ziwei Zhu, Zhiyong Wu, Runnan Li, Helen Meng, Lianhong Cai


With the explosive development of human-computer speech interaction, spoken term detection is widely required and has attracted increasing interest. In this paper, we propose a weak supervised approach using Siamese recurrent auto-encoder (RAE) to represent speech segments for query-by-example spoken term detection (QbyE-STD). The proposed approach exploits word pairs that contain different instances of the same/different word content as input to train the Siamese RAE. The encoder last hidden state vector of Siamese RAE is used as the feature for QbyE-STD, which is a fixed dimensional embedding feature containing mostly semantic content related information. The advantages of the proposed approach are: 1) extracting more compact feature with fixed dimension while keeping the semantic information for STD; 2) the extracted feature can describe the sequential phonetic structure of similar sounds to degree, which can be applied for zero-resource QbyE-STD. Evaluations on real scene Chinese speech interaction data and TIMIT confirm the effectiveness and efficiency of the proposed approach compared to the conventional ones.


 DOI: 10.21437/Interspeech.2018-1788

Cite as: Zhu, Z., Wu, Z., Li, R., Meng, H., Cai, L. (2018) Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection. Proc. Interspeech 2018, 102-106, DOI: 10.21437/Interspeech.2018-1788.


@inproceedings{Zhu2018,
  author={Ziwei Zhu and Zhiyong Wu and Runnan Li and Helen Meng and Lianhong Cai},
  title={Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={102--106},
  doi={10.21437/Interspeech.2018-1788},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1788}
}