Redefining the Linguistic Context Feature Set for HMM and DNN TTS Through Position and Parsing

Rasmus Dall, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, Keiichi Tokuda


In this paper we present an investigation of a number of alternative linguistic feature context sets for HMM and DNN text-to-speech synthesis. The representation of positional values is explored through two alternatives to the standard set of absolute values, namely relational and categorical values. In a preference test the categorical representation was found to be preferred for both HMM and DNN synthesis. Subsequently, features based on probabilistic context free grammar and dependency parsing are presented. These features represent the phrase level relations between words in the sentences, and in a preference evaluation it was found that these features all improved upon the base set, with a combination of both parsing methods best overall. As the features primarily affected the F0 prediction, this illustrates the potential of syntactic structure to improve prosody in TTS.


DOI: 10.21437/Interspeech.2016-399

Cite as

Dall, R., Hashimoto, K., Oura, K., Nankaku, Y., Tokuda, K. (2016) Redefining the Linguistic Context Feature Set for HMM and DNN TTS Through Position and Parsing. Proc. Interspeech 2016, 2851-2855.

Bibtex
@inproceedings{Dall+2016,
author={Rasmus Dall and Kei Hashimoto and Keiichiro Oura and Yoshihiko Nankaku and Keiichi Tokuda},
title={Redefining the Linguistic Context Feature Set for HMM and DNN TTS Through Position and Parsing},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-399},
url={http://dx.doi.org/10.21437/Interspeech.2016-399},
pages={2851--2855}
}