Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection

Hongjie Chen, Cheung-Chi Leung, Lei Xie, Bin Ma, Haizhou Li


We propose a framework which ports Dirichlet Gaussian mixture model (DPGMM) based labels to deep neural network (DNN). The DNN trained using the unsupervised labels is used to extract a low-dimensional unsupervised speech representation, named as unsupervised bottleneck features (uBNFs), which capture considerable information for sound cluster discrimination. We investigate the performance of uBNF in query-by-example spoken term detection (QbE-STD) on the TIMIT English speech corpus. Our uBNF performs comparably with the cross-lingual bottleneck features (BNFs) extracted from a DNN trained using 171 hours of transcribed telephone speech in another language (Mandarin Chinese). With the score fusion of uBNFs and cross-lingual BNFs, we gain about 10% relative improvement in terms of mean average precision (MAP) comparing with the cross-lingual BNFs. We also study the performance of the framework with different input features and different lengths of temporal context.


DOI: 10.21437/Interspeech.2016-313

Cite as

Chen, H., Leung, C., Xie, L., Ma, B., Li, H. (2016) Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection. Proc. Interspeech 2016, 923-927.

Bibtex
@inproceedings{Chen+2016,
author={Hongjie Chen and Cheung-Chi Leung and Lei Xie and Bin Ma and Haizhou Li},
title={Unsupervised Bottleneck Features for Low-Resource Query-by-Example Spoken Term Detection},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-313},
url={http://dx.doi.org/10.21437/Interspeech.2016-313},
pages={923--927}
}