SLTU-2008 - First International Workshop on Spoken Languages Technologies for Under-Resourced Languages

Hanoi, Vietnam
May 5-7, 2008

Synthesizer Voice Quality of New Languages Calibrated with Mean Mel Cepstral Distortion

John Kominek, Tanja Schultz, Alan W. Black

Language Technology Institute, Carnegie-Mellon University, Pittsburgh, PA, USA

When developing synthesizers for new languages one must select a phoneset, record phonetically balanced sentences, build up a pronunciation lexicon, and evaluate the results. An objective measure of voice quality can be very useful, provided it is calibrated across multiple speakers, languages, and databases. As a substitute for full listening tests, this paper adopts mel-capstral distortion as a measure of spectral accuracy, and proposes systematic variation of a known English corpus as a method of calibration. We find that doubling the database size reduces MCD by o.12, while reverting to a grapheme-based voice increases it by 0.27. This offers a frame of reference for estimationg voice quality, which is applied to a test suite of 8 non-English languages.

Full Paper
Presentation (pdf)

Bibliographic reference.  Kominek, John / Schultz, Tanja / Black, Alan W. (2008): "Synthesizer voice quality of new languages calibrated with mean mel cepstral distortion", In SLTU-2008, 63-68.