Precise control of articulatory parameters is difficult and prevents a physical model from generating natural sounding speech signals. To determine vocal-tract shape from speech, this paper presents an inversion method for simultaneously estimating the cross-sectional area and length of the vocal tract. In addition, we performed speech resynthesis from a time-series of estimated vocal-tract shapes. The vocal-tract shape is determined through an iterative procedure that gradually optimizes the parameter values to produce the target speech spectrum. The vocal-tract shape is updated using a sensitivity function that represents the change in formant frequency caused by a small perturbation of the vocal-tract shape. When combined with a perturbation relationship of speech spectrum parameters (i.e., cepstrum parameters) and formants, our method effectively optimizes the vocal-tract shape. We quantitatively examined the accuracy using area function data for 10 isolated vowels. The results showed that the average area error was 0.43 cm2 and the average length error was 0.23 cm. This indicates that the vocal-tract shape was determined with satisfactory accuracy. We also performed an estimation experiment for continuous speech and synthesized speech from the estimated vocal-tract shape.
Bibliographic reference. Kaburagi, Tokihiko (2014): "Estimation of vocal-tract shape from speech spectrum and speech resynthesis based on a generative model", In INTERSPEECH-2014, 422-426.