Fifth ISCA ITRW on Speech Synthesis
June 14-16, 2004
Corpus-based technologies, e.g., unit selection and concatenative synthesis, have dramatically improved the naturalness of synthetic speech. These approaches make it possible to use Text-to-Speech (TTS) more widely: however, they are still not appropriate for flexibly synthesizing various types of speech. Voice conversion is a potential technique for a flexible synthesis. This technique enables us to modify speech using conversion rules statistically extracted from only a small amount of speech data. Speaker conversion is well known as a typical application of voice conversion. We can also apply this technique to other applications, e.g., speaking style conversion.
This tutorial will provide an overview of voice conversion, focusing on statistical spectral conversion. Following an outline of a general framework for the spectral conversion, we will review some conventional conversion methods. As the most popular conversion method, we will show the details of a conversion algorithm based on a Gaussian Mixture Model (GMM) proposed by Stylianou. Although the GMM-based conversion method can convert spectra more appropriately than the other methods, e.g., Vector Quantization and Linear Multivariate Regression, the deterioration of speech quality is caused by some problems. In the tutorial, we will discuss the following problems: 1) the conversion function is not supported by a proper statistical model, 2) some spectral discontinuities are caused by the frame-based conversion, and 3) the converted spectra are excessively smoothed by the statistical modeling. Some techniques for addressing these problems will be provided.
Finally, some examples of an application of voice conversion will be introduced. We will discuss remaining problems to be solved for using voice conversion in the practical situation. The tutorial will also provide information about a voice conversion package that will be released from FestVox this summer.
Bibliographic reference. Toda, Tomoki (2004): "Overview of voice conversion", In SSW5-2004 (abstract).