This paper describes a method for speaker clustering, with the application of building average voice models for speaker-adaptive HMMbased speech synthesis that are a good basis for adapting to specific target speakers. Our main hypothesis is that using perceptually similar speakers to build the average voice model will be better than use unselected speakers, even if the amount of data available from perceptually similar speakers is smaller. We measure the the perceived similarities among a group of 30 female speakers in a listening test and then apply multiple linear regression to automatically predict these listener judgements of speaker similarity and thus to identify similar speakers automatically. We then compare a variety of average voice models trained on either speakers who were perceptually judged to be similar to the target speaker, or speakers selected by the multiple linear regression, or a large global set of unselected speakers. We find that the average voice model trained on perceptually similar speakers provides better performance than the global model, even though the latter is trained on more data, confirming our main hypothesis. However, the average voice model using speakers selected automatically by the multiple linear regression does not reach the same level of performance.
Index Terms: Statistical parametric speech synthesis, hidden Markov models, speaker adaptation
Bibliographic reference. Dall, Rasmus / Veaux, Christophe / Yamagishi, Junichi / King, Simon (2012): "Analysis of speaker clustering strategies for HMM-based speech synthesis", In INTERSPEECH-2012, 995-998.