10th Annual Conference of the International Speech Communication Association

Brighton, United Kingdom
September 6-10, 2009

Observation of Empirical Cumulative Distribution of Vowel Spectral Distances and Its Application to Vowel Based Voice Conversion

Hideki Kawahara (1), Masanori Morise (2), Toru Takahashi (3), Hideki Banno (4), Ryuichi Nisimura (1), Toshio Irino (1)

(1) Wakayama University, Japan
(2) Ritsumeikan University, Japan
(3) Kyoto University, Japan
(4) Meijo University, Japan

A simple and fast voice conversion method based only on vowel information is proposed. The proposed method relies on empirical distribution of perceptual spectral distances between representative examples of each vowel segment extracted using TANDEM-STRAIGHT spectral envelope estimation procedure [1]. Mapping functions of vowel spectra are designed to preserve vowel space structure defined by the observed empirical distribution while transforming position and orientation of the structure in an abstract vowel spectral space. By introducing physiological constraints in vocal tract shapes and vocal tract length normalization, difficulties in careful frequency alignment between vowel template spectra of the source and the target speakers can be alleviated without significant degradations in converted speech. The proposed method is a frame-based instantaneous method and is relevant for real-time processing. Applications of the proposed method in-cross language voice conversion are also discussed.

