We herein propose a deep neural network-based model for articulatory-to-acoustic conversion from real-time MRI data. Although rtMRI, which can record entire articulatory organs with a high resolution, has an advantage in articulatory-to-acoustic conversion, it has a relatively low sampling rate. To address this, we incorporated the super-resolution technique in the temporal dimension with a transposed convolution. With the use of transposed convolution, the resolution can be increased by applying the inversion process of resolution reduction of a standard CNN. To evaluate the performance on the datasets with different temporal resolutions, we conducted experiments using two datasets: USC-TIMIT and Japanese rtMRI dataset. Results of the experiments performed using mel-cepstrum distortion and PESQ showed that transposed convolution is effective for generating accurate acoustic features. We also confirmed that increasing the magnification of the super-resolution leads to an improvement in the PESQ score.
Cite as: Tanji, R., Ohmura, H., Katsurada, K. (2021) Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data. Proc. Interspeech 2021, 3176-3180, doi: 10.21437/Interspeech.2021-906
@inproceedings{tanji21_interspeech, author={Ryo Tanji and Hidefumi Ohmura and Kouichi Katsurada}, title={{Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data}}, year=2021, booktitle={Proc. Interspeech 2021}, pages={3176--3180}, doi={10.21437/Interspeech.2021-906} }