ISCA Archive Interspeech 2017
ISCA Archive Interspeech 2017

Generative Adversarial Network-Based Postfilter for STFT Spectrograms

Takuhiro Kaneko, Shinji Takaki, Hirokazu Kameoka, Junichi Yamagishi

We propose a learning-based postfilter to reconstruct the high-fidelity spectral texture in short-term Fourier transform (STFT) spectrograms. In speech-processing systems, such as speech synthesis, conversion, enhancement, separation, and coding, STFT spectrograms have been widely used as key acoustic representations. In these tasks, we normally need to precisely generate or predict the representations from inputs; however, generated spectra typically lack the fine structures that are close to those of the true data. To overcome these limitations and reconstruct spectra having finer structures, we propose a generative adversarial network (GAN)-based postfilter that is implicitly optimized to match the true feature distribution in adversarial learning. The challenge with this postfilter is that a GAN cannot be easily trained for very high-dimensional data such as STFT spectra. We take a simple divide-and-concatenate strategy. Namely, we first divide the spectrograms into multiple frequency bands with overlap, reconstruct the individual bands using the GAN-based postfilter trained for each band, and finally connect the bands with overlap. We tested our proposed postfilter on a deep neural network-based text-to-speech task and confirmed that it was able to reduce the gap between synthesized and target spectra, even in the high-dimensional STFT domain.

doi: 10.21437/Interspeech.2017-962

Cite as: Kaneko, T., Takaki, S., Kameoka, H., Yamagishi, J. (2017) Generative Adversarial Network-Based Postfilter for STFT Spectrograms. Proc. Interspeech 2017, 3389-3393, doi: 10.21437/Interspeech.2017-962

  author={Takuhiro Kaneko and Shinji Takaki and Hirokazu Kameoka and Junichi Yamagishi},
  title={{Generative Adversarial Network-Based Postfilter for STFT Spectrograms}},
  booktitle={Proc. Interspeech 2017},