Unsupervised phoneme segmentation aims at dividing a speech stream into phonemes without using any prior knowledge of linguistic contents and acoustic models. In , we formulated this problem into an optimization framework, and developed an objective function, summation of squared error (SSE) based on the Euclidean distance of cepstral features. However, it is unknown whether or not Euclidean distance yields the best metric to estimate the goodness of segmentations. In this paper, we study how to learn a good metric to improve the performance of segmentation. We propose two criteria for learning metric: Minimum of Summation Variance (MSV) and Maximum of Discrimination Variance (MDV). The experimental results on TIMIT database indicate that the use of learning metric can achieve better segmentation performances. The best recall rate of this paper is 81.8% (20ms windows), compared to 77.5% of . We also introduce an iterative algorithm to learn metric without using labeled data, which achieves similar results as those with labeled data.
Bibliographic reference. Qiao, Yu / Minematsu, Nobuaki (2008): "Metric learning for unsupervised phoneme segmentation", In INTERSPEECH-2008, 1060-1063.