In this paper we describe two unsupervised representations of prosodic sequences based on k-means and Dirichlet Process Gaussian Mixture Model (DPGMM) clustering. The clustering algorithms are used to infer an inventory of prosodic categories over automatically segmented syllables. A tri-gram model is trained over these sequences to characterize speech. We find that DPGMM clusters show a greater correspondence with manual ToBI labels than k-means clusters. However, sequence models trained on k-means clusters significantly outperform DPGMM sequences in classifying speaking style, nativeness and speakers. We also investigate the use of these sequence models in the detection of outliers regarding these three tasks. Non-parametric Bayesian techniques have the advantage of being able to learn a clustering solution and infer the number of clusters directly from data. While it is attractive to avoid specifying k before clustering, on the tasks of characterizing prosodic sequences we find that effective use of DPGMMs still requires a significant amount of parameter tuning, and performance fails to reach the level of k-means.
Bibliographic reference. Rosenberg, Andrew (2013): "Modeling prosodic sequences with k-means and dirichlet process GMMs", In INTERSPEECH-2013, 520-524.