15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Noise-Robust TTS Speaker Adaptation with Statistics Smoothing

Kayoko Yanagisawa, Langzhou Chen, Mark J. F. Gales

Toshiba Research Europe, UK

In practical scenarios for speaker adaptation of speech synthesis systems, the quality of adaptation audio data may be poor. In these situations, it is necessary to make use of the available audio to capture the speaker attributes, whilst aiming to obtain a synthesis voice which does not have any of the low-quality attributes of the audio. One approach to achieving this is to define a sub-space of parametric synthesis parameters in which the adapted system must lie. Though this yields reasonable synthesis quality, target speaker similarity degrades. Quality is also affected in severe noise conditions. This paper describes a smoothing approach that addresses this problem. For a noisy target speaker, first a `similar speaker' is selected from a database of speakers. Statistics from this speaker are then smoothed with those obtained from the target speaker. By appropriately combining the two sources of information, it is possible to balance similarity and quality. Results indicate that both the quality and similarity can be improved by smoothing, especially for severe noise conditions. The similarity performance, however, varies from speaker to speaker, indicating the importance of a reasonable automatic speaker selection method and the coverage of the candidate speaker pool.

Full Paper

Bibliographic reference.  Yanagisawa, Kayoko / Chen, Langzhou / Gales, Mark J. F. (2014): "Noise-robust TTS speaker adaptation with statistics smoothing", In INTERSPEECH-2014, 1519-1523.