Sixth International Conference on Spoken Language Processing
An intonation modeling scheme for Japanese text-to-speech synthesis is proposed using a command response F0 model and a neural network to generate F0 contours of accentual phrases uttered in continuous speech. The neural network is used to predict the values of P0 model parameters for a whole sentence, focusing on accentual phrases. The features used as inputs to the neural network are: position of the accentual phrase within the sentence, number of rnorae in the accentual phrase, accent type of the accentual phrase, number of words in the accentual phrase, and parts-of-speech of the first and last words of the accentual phrase. The predicted parameters are: a flag that indicates the presence of a phrase command at the beginning of the accentual phrase, magnitude of the phrase command (if present), amplitude of the accent command, and offset values for the timing of phrase and accent commands. All features are simultaneously predicted. Three types of neural network structures are used, each one with 3 different numbers of elements in the single hidden laver: MLP (multi-layer perceptron), Elman, and Jordan. The method permits efficient prediction of F0 model parameters, as observed in evaluation experiments and informal listening tests.
Bibliographic reference. Sakurai, Atsuhiro / Minematsu, Nobuaki / Hirose, Keikichi (2000): "Data-driven intonation modeling using a neural network and a command response model", In ICSLP-2000, vol.3, 223-226.