15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

A Flexible Front-End for HTS

Matthew P. Aylett (1), Rasmus Dall (2), Arnab Ghoshal (2), Gustav Eje Henter (2), Thomas Merritt (2)

(1) CereProc, UK
(2) University of Edinburgh, UK

Parametric speech synthesis techniques depend on full context acoustic models generated by language front-ends, which analyse linguistic and phonetic structure. HTS, the leading parametric synthesis system, can use a number of different front-ends to generate full context models for synthesis and training. In this paper we explore the use of a new text processing front-end that has been added to the speech recognition toolkit Kaldi as part of an ongoing project to produce a new parametric speech synthesis system, Idlak. The use of XML specification files, a modular design, and modern coding and testing approaches, make the Idlak front-end ideal for adding, altering and experimenting with the contexts used in full context acoustic models. The Idlak front-end was evaluated against the standard Festival front-end in the HTS system. Results from the Idlak front-end compare well with the more mature Festival front-end (Idlak - 2.83 MOS vs Festival - 2.85 MOS), although a slight reduction in naturalness perceived by non-native English speakers can be attributed to Festival's insertion of non-punctuated pauses.

Full Paper

Bibliographic reference.  Aylett, Matthew P. / Dall, Rasmus / Ghoshal, Arnab / Henter, Gustav Eje / Merritt, Thomas (2014): "A flexible front-end for HTS", In INTERSPEECH-2014, 1283-1287.