We consider the problem of unsupervised acoustic unit mining from unlabeled speech data. One typical method involves two steps: unsupervised segmentation and segment clustering. This paper proposes to improve segment clustering with segment-level Gaussian posteriorgram representation, which is generated by averaging the frame-level Gaussian posterior probabilities within each segment. Stacking together the segment-level Gaussian posteriorgrams of all the speech data, a Gaussian-by-segment data matrix is constructed. Given the Gaussian-by-segment matrix, we have the flexibility to cluster either the Gaussian components or the segments into different acoustic unit categories. We have investigated both normalized cut and non-negative matrix factorization approaches on the data matrix for the clustering purpose. We carried out experiments to measure the quality of the clustering results with reference to manual phoneme labels. Experimental results show that the proposed methods consistently outperform a traditional vector quantization method and a Gaussian mixture model labeling method.
Bibliographic reference. Wang, Haipeng / Lee, Tan / Leung, Cheung-Chi / Ma, Bin / Li, Haizhou (2013): "Unsupervised mining of acoustic subword units with segment-level Gaussian posteriorgrams", In INTERSPEECH-2013, 2297-2301.