The objective of this work is to explore a novel unsupervised framework, using Restricted Boltzmann machines, for Spoken Word Retrieval (SWR). In the absence of labelled speech data, SWR is typically performed by matching sequence of feature vectors of query and test utterances using dynamic time warping (DTW). In such a scenario, performance of SWR system critically depends on representation of the speech signal. Typical features, like mel-frequency cepstral coefficients (MFCC), carry significant speaker-specific information, and hence may not be used directly in SWR system. To overcome this issue, we propose to capture the joint density of the acoustic space spanned by MFCCs using Gaussian-Bernoulli restricted Boltzmann machine (GBRBM). In this work, we have used hidden activations of the GBRBM as features for SWR system. Since the GBRBM is trained with speech data collected from large number of speakers, the hidden activations are more robust to the speaker-variability compared to MFCCs. The performance of the proposed features is evaluated on Telugu broadcast news data, and an absolute improvement of 12% was observed compared to MFCCs.
Bibliographic reference. Pappagari, Raghavendra Reddy / Nayak, Shekhar / Murty, K. Sri Rama (2014): "Unsupervised spoken word retrieval using Gaussian-bernoulli restricted boltzmann machines", In INTERSPEECH-2014, 1737-1741.