The paper proposes an unsupervised framework to address the problem of spotting spoken terms in large speech databases. A two-stage retrieval mechanism is used to perform spoken term detection. A very efficient Bag of Acoustic Words (BoAW) index is created for quick retrieval of relevant documents. Using an N-gram approach, the optimum choice of acoustic dictionary that best describes the document is obtained. Once a quick reduction in search space is achieved in the first phase, the results are fed to the second stage of the retrieval engine. Here, a computationally optimised variant of dynamic programming, called Non-Segmental Dynamic Time Warping (NS-DTW), is used to further prune the results. All the experiments are conducted on MediaEval 2012 dataset. Performance is evaluated at the output of each stage, and the optimum parameters are obtained. We show that the cascade of these two stages helps in reducing the probable search space, which translates to higher search speeds, while ensuring comparable performance. The significance of the indexing framework is proved by its comparison against a random selection system.
Bibliographic reference. George, Basil / Saxena, Abhijeet / Mantena, Gautam / Prahallad, Kishore / Yegnanarayana, B. (2014): "Unsupervised query-by-example spoken term detection using bag of acoustic words and non-segmental dynamic time warping", In INTERSPEECH-2014, 1742-1746.