Spoken term detection is important for retrieval of multimedia and spoken content over the Internet. Because it is difficult to have acoustic/language models well matched to the huge quantities of spoken documents produced under various conditions, unsupervised approaches using frame-based dynamic time warping (DTW) has been proposed to compare the spoken query with spoken documents frame by frame. In this paper, we propose a new approach of unsupervised spoken term detection using segment-based DTW. Speech signals are segmented into sequences of acoustically similar segments using hierarchical agglomerative clustering, and a DTW procedure is formulated for segment sequences along with the clustering tree structures. In this way, the number of highly redundant parameters can be reduced, and the relatively unstable feature vectors can be replaced by more stable segments which describe the sequence of vocal track stages during the uttering process. Preliminary experiments indicate a high reduction of computation time as compared to frame-based DTW, although the slightly degraded detection performance implies much room for further improvements.
Bibliographic reference. Chan, Chun-an / Lee, Lin-shan (2010): "Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping", In INTERSPEECH-2010, 693-696.