In practical scenarios for speaker adaptation of speech synthesis systems, the quality of adaptation audio data may be poor. In these situations, it is necessary to make use of the available audio to capture the speaker attributes, whilst aiming to obtain a synthesis voice which does not have any of the low-quality attributes of the audio. One approach to achieving this is to define a sub-space of parametric synthesis parameters in which the adapted system must lie. Though this yields reasonable synthesis quality, target speaker similarity degrades. Quality is also affected in severe noise conditions. This paper describes a smoothing approach that addresses this problem. For a noisy target speaker, first a `similar speaker' is selected from a database of speakers. Statistics from this speaker are then smoothed with those obtained from the target speaker. By appropriately combining the two sources of information, it is possible to balance similarity and quality. Results indicate that both the quality and similarity can be improved by smoothing, especially for severe noise conditions. The similarity performance, however, varies from speaker to speaker, indicating the importance of a reasonable automatic speaker selection method and the coverage of the candidate speaker pool.
Bibliographic reference. Yanagisawa, Kayoko / Chen, Langzhou / Gales, Mark J. F. (2014): "Noise-robust TTS speaker adaptation with statistics smoothing", In INTERSPEECH-2014, 1519-1523.