15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Generating Multiple-Accent Pronunciations for TTS Using Joint Sequence Model Interpolation

BalaKrishna Kolluru, Vincent Wan, Javier Latorre, Kayoko Yanagisawa, Mark J. F. Gales

Toshiba Research Europe, UK

Standard grapheme-to-phoneme (G2P) systems are trained using a homogeneous lexicon, for example one associated with a particular accent. In practice, a synthesis system may be required to handle multiple accents. Furthermore, a speaker rarely has a pure accent; accents vary continuously within and between regions of a country. Generating phonetic sequences for each accent is possible, but combining them to yield a single synthesis pronunciation is highly challenging. To address this problem, this paper considers a space of accents. The bases for these spaces are defined by statistical G2P models in the form of graphone models. A linear combination of these models define the accent space. By selecting a point in this continuous space, it is possible to specify the accent for an individual speaker. The performance of this approach is evaluated using an accent space defined by American, Scottish and British English. By moving around the accent space, it is shown that it is possible to synthesize speech from all these accents as well as a range of intermediate points.

Full Paper

Bibliographic reference.  Kolluru, BalaKrishna / Wan, Vincent / Latorre, Javier / Yanagisawa, Kayoko / Gales, Mark J. F. (2014): "Generating multiple-accent pronunciations for TTS using joint sequence model interpolation", In INTERSPEECH-2014, 1273-1277.