In our previous work, we have developed a speech modification system capable of manipulating unobserved articulatory movements by sequentially performing speech-to-articulatory inversion mapping and articulatory-to-speech production mapping based on a Gaussian mixture model (GMM)-based statistical feature mapping technique. One of the biggest issues to be addressed in this system is quality degradation of the synthetic speech caused by modeling and conversion errors in a vocoder-based waveform generation framework. To address this issue, we propose several implementation methods of direct waveform modification. The proposed methods directly filter an input speech waveform with a time sequence of spectral differential parameters calculated between unmodified and modified spectral envelop parameters in order to avoid using vocoder-based excitation signal generation. The experimental results show that the proposed direct waveform modification methods yield significantly larger quality improvements in the synthetic speech while also keeping a capability of intuitively modifying phoneme sounds by manipulating the unobserved articulatory movements.
Bibliographic reference. Tobing, Patrick Lumban / Kobayashi, Kazuhiro / Toda, Tomoki / Neubig, Graham / Sakti, Sakriani / Nakamura, Satoshi (2015): "Articulatory controllable speech modification based on Gaussian mixture models with direct waveform modification using spectrum differential", In INTERSPEECH-2015, 3350-3354.