Deep Learning Techniques in Tandem with Signal Processing Cues for Phonetic Segmentation for Text to Speech Synthesis in Indian Languages

Arun Baby, Jeena J. Prakash, Rupak Vignesh, Hema A. Murthy


Automatic detection of phoneme boundaries is an important sub-task in building speech processing applications, especially text-to-speech synthesis (TTS) systems. The main drawback of the Gaussian mixture model - hidden Markov model (GMM-HMM) based forced-alignment is that the phoneme boundaries are not explicitly modeled. In an earlier work, we had proposed the use of signal processing cues in tandem with GMM-HMM based forced alignment for boundary correction for building Indian language TTS systems. In this paper, we capitalise on the ability of robust acoustic modeling techniques such as deep neural networks (DNN) and convolutional deep neural networks (CNN) for acoustic modeling. The GMM-HMM based forced alignment is replaced by DNN-HMM/CNN-HMM based forced alignment. Signal processing cues are used to correct the segment boundaries obtained using DNN-HMM/CNN-HMM segmentation. TTS systems built using these boundaries show a relative improvement in synthesis quality.


 DOI: 10.21437/Interspeech.2017-666

Cite as: Baby, A., Prakash, J.J., Vignesh, R., Murthy, H.A. (2017) Deep Learning Techniques in Tandem with Signal Processing Cues for Phonetic Segmentation for Text to Speech Synthesis in Indian Languages. Proc. Interspeech 2017, 3817-3821, DOI: 10.21437/Interspeech.2017-666.


@inproceedings{Baby2017,
  author={Arun Baby and Jeena J. Prakash and Rupak Vignesh and Hema A. Murthy},
  title={Deep Learning Techniques in Tandem with Signal Processing Cues for Phonetic Segmentation for Text to Speech Synthesis in Indian Languages},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3817--3821},
  doi={10.21437/Interspeech.2017-666},
  url={http://dx.doi.org/10.21437/Interspeech.2017-666}
}