This paper presents a novel statistical singing voice conversion (SVC) technique with direct waveform modification based on the spectrum differential that can convert voice timbre of a source singer into that of a target singer without using a vocoder to generate converted singing voice waveforms. SVC makes it possible to convert singing voice characteristics of an arbitrary source singer into those of an arbitrary target singer. However, speech quality of the converted singing voice is significantly degraded compared to that of a natural singing voice due to various factors, such as analysis and modeling errors in the vocoder-based framework. To alleviate this degradation, we propose a statistical conversion process that directly modifies the signal in the waveform domain by estimating the difference in the spectra of the source and target singers' singing voices. The differential spectral feature is directly estimated using a differential Gaussian mixture model (GMM) that is analytically derived from the traditional GMM used as a conversion model in the conventional SVC. The experimental results demonstrate that the proposed method makes it possible to significantly improve speech quality in the converted singing voice while preserving the conversion accuracy of singer identity compared to the conventional SVC.
Bibliographic reference. Kobayashi, Kazuhiro / Toda, Tomoki / Neubig, Graham / Sakti, Sakriani / Nakamura, Satoshi (2014): "Statistical singing voice conversion with direct waveform modification based on the spectrum differential", In INTERSPEECH-2014, 2514-2518.