7th International Conference on Spoken Language Processing

September 16-20, 2002
Denver, Colorado, USA

A Context Clustering Technique for Average Voice Model in HMM-Based Speech Synthesis

Junichi Yamagishi (1), Masatsune Tamura (1), Takashi Masuko (1), Keiichi Tokuda (2), Takao Kobayashi (1)

(1) Tokyo Institute of Technology, Japan; (2) Nagoya Institute of Technology, Japan

This paper describes a new technique for constructing a decision tree used for clustering average voice model, i.e., speaker independent speech units. In the technique, we first train speaker dependent models using multi-speaker speech database, and then construct a speaker independent decision tree for context clustering common to these speaker dependent models. When a node of the decision tree is split, only the context related questions which can split the node for all speaker dependent models is adopted. Consequently, all nodes of the decision tree have all speakersí training data. From the result of the paired comparison test, we show that the average voice model trained using the proposed technique can synthesize more natural sounding speech than the conventional average voice model.

Full Paper

Bibliographic reference.  Yamagishi, Junichi / Tamura, Masatsune / Masuko, Takashi / Tokuda, Keiichi / Kobayashi, Takao (2002): "A context clustering technique for average voice model in HMM-based speech synthesis", In ICSLP-2002, 133-136.