ISCA Archive SCST 1990
ISCA Archive SCST 1990

Experiments with voice modelling in speech synthesis

Rolf Carlson, Björn Granström, Inger Karlsson

The need for voice variations is apparent in different speech synthesis applications, such as voice prosthesis and translating telephony. Speaking style variations are an important means of discriminating information of different kinds, Bladon et al. (1987). Data on speaker variability is now being accumulated Fant et al. (1990) have investigated speaker variations in the context of a multi-talker speech data base. More detailed analysis of voice source dynamics have been studied by Gobi (1988) and Karlsson (1988).

In this presentation we want to describe some recent experiments with voice modelling. We have used speech synthesis as a research vehicle to study both global effects, voice transformations and more individualized transforms implemented by changes in definitions and rules in the text-to-speech system. The present research versions of the KTH text-to-speech system and the possibility for interactive manipulations at the parameter level with on-screen reference to natural speech constitute a flexible environment for such experiments. Special effort is invested into the creation of a female voice. Transformations by global rules of male parameters are not judged to be sufficient. Changes in definitions and rules are made according to data from a natural female voice.

We have recently implemented a more realistic voice source, an expansion of the LF model. The spectral properties of this new voice source and the possibility of dynamic variation have proved to be essential for modelling a female voice.

Our approach has been to make a stylization of individual utterances spoken by speakers with different voice characteristics. In this process we have started with rule generated parameters and adjusted the target values according to the different voices, using the possibility to overlay a spectrogram of natural speech on the speech synthesis parameter traces. Results from inverse filtering have been used in setting the appropriate voice source parameters. Typically one or two specifications per speech sound has been used for each vocal tract and source parameter, i.e. in contrast to the approach taken by Pinto et al. (1989) we are not trying to model the human speaker frame by frame but rather to make a stylization that later on can be generalized when formulating rules for the different speakers.

In our paper we will describe an extended synthesizer GLOVE, compared to the stan- dard OVE III (Liljencrants, 1968) including several new features like a modified LF source model (Figure 1). Then we will describe the software environment that has been created for this kind of synthesis work. Finally we will illustrate the approximation method by presenting two synthesized versions, using the two synthesizers, and a natural female utterance of the same sentence. Some of the remaining modelling problems will be discussed.

Cite as: Carlson, R., Granström, B., Karlsson, I. (1990) Experiments with voice modelling in speech synthesis. Proc. ESCA Workshop on Speaker Characterization in Speech Technology, 28-39

  author={Rolf Carlson and Björn Granström and Inger Karlsson},
  title={{Experiments with voice modelling in speech synthesis}},
  booktitle={Proc. ESCA Workshop on Speaker Characterization in Speech Technology},