In statistical speech synthesis, the quality of the synthesized speech depends on the quality of training data. As the sampling rate of speech is one of the effective factors, speech data has been recently recorded at a high sampling rate. However, the sampling rates of speech data recorded in the past or collected from the Internet were often low. Therefore, to use these speech data effectively for model training, we propose a mel-cepstral analysis technique that restores missing high frequency components from low-sampling-rate speech with a statistical approach. In this technique, high-sampling-rate speech waveforms are modeled directly by integrating feature extraction and modeling processes. This framework makes it possible to optimize whole processes on the basis of an integrated objective function. Then, mel-cepstral coefficients are estimated from the low-sampling-rate speech by using the model as a prior distribution. Experimental results show that the proposed method improved the quality of synthesized speech.
Bibliographic reference. Nakamura, Kazuhiro / Hashimoto, Kei / Oura, Keiichiro / Nankaku, Yoshihiko / Tokuda, Keiichi (2014): "A mel-cepstral analysis technique restoring high frequency components from low-sampling-rate speech", In INTERSPEECH-2014, 2494-2498.