Singing Voice Synthesis Based on Deep Neural Networks

Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda

Singing voice synthesis techniques have been proposed based on a hidden Markov model (HMM). In these approaches, the spectrum, excitation, and duration of singing voices are simultaneously modeled with context-dependent HMMs and waveforms are generated from the HMMs themselves. However, the quality of the synthesized singing voices still has not reached that of natural singing voices. Deep neural networks (DNNs) have largely improved on conventional approaches in various research areas including speech recognition, image recognition, speech synthesis, etc. The DNN-based text-to-speech (TTS) synthesis can synthesize high quality speech. In the DNN-based TTS system, a DNN is trained to represent the mapping function from contextual features to acoustic features, which are modeled by decision tree-clustered context dependent HMMs in the HMM-based TTS system. In this paper, we propose singing voice synthesis based on a DNN and evaluate its effectiveness. The relationship between the musical score and its acoustic features is modeled in frames by a DNN. For the sparseness of pitch context in a database, a musical-note-level pitch normalization and linear-interpolation techniques are used to prepare the excitation features. Subjective experimental results show that the DNN-based system outperformed the HMM-based system in terms of naturalness.

DOI: 10.21437/Interspeech.2016-1027

Cite as

Nishimura, M., Hashimoto, K., Oura, K., Nankaku, Y., Tokuda, K. (2016) Singing Voice Synthesis Based on Deep Neural Networks. Proc. Interspeech 2016, 2478-2482.

author={Masanari Nishimura and Kei Hashimoto and Keiichiro Oura and Yoshihiko Nankaku and Keiichi Tokuda},
title={Singing Voice Synthesis Based on Deep Neural Networks},
booktitle={Interspeech 2016},