We propose a constrained shift and scale invariant sparse coding model for the purpose of unsupervised segmentation and clustering of speech into acoustically relevant sub-word units for automatic speech recognition. We introduce a novel local search algorithm that iteratively improves the acoustic relevance of the automatically-determined sub-word units from a random initialization by repeated alignment and subsequent re-estimation with the training material. We also contribute an associated population-based metaheuristic optimisation procedure related to genetic approaches to achieve a global search for the most acoustically relevant set of sub-word units. A first application of this metaheuristic search indicates that it yields an improvement over a corresponding local search. Using a subset of TIMIT for training, we also find that some of the automatically-determined sub-word units in our final dictionaries exhibit a strong correlation with the reference phonetic transcriptions. Furthermore, in some cases our sub-word transcriptions yield a compact set of often-used pronunciations. Informal listening tests indicate that the clustering is robust, and provides optimism that our approach will be suited to the task of generating pronunciation dictionaries that can be used for ASR.
Bibliographic reference. Agenbag, Wiehan / Niesler, Thomas (2015): "Automatic segmentation and clustering of speech using sparse coding and metaheuristic search", In INTERSPEECH-2015, 3184-3188.