Conventional statistical parametric speech synthesis relies on decision trees to cluster together similar contexts, resulting in tied-parameter context-dependent hidden Markov models (HMMs). However, decision tree clustering has a major weakness: it use hard division and subdivides the model space based on one feature at a time, fragmenting the data and failing to exploit interactions between linguistic context features. These linguistic features themselves are also problematic, being noisy and of varied relevance to the acoustics. We propose to combine our previous work on vector-space representations of linguistic context, which have the added advantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input. The outputs of the network are probability distributions over speech features. Maximum Likelihood Parameter Generation is then used to create parameter trajectories, which in turn drive a vocoder to generate the waveform. Various configurations of the system are compared, using both conventional and vector space context representations and with the DNN making speech parameter predictions at two different temporal resolutions: frames, or states. Both objective and subjective results are presented.
Index Terms: TTS, speech synthesis, deep neural network, vector space model, unsupervised learning
Cite as: Lu, H., King, S., Watts, O. (2013) Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis. Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8), 261-265
@inproceedings{lu13_ssw, author={Heng Lu and Simon King and Oliver Watts}, title={{Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis}}, year=2013, booktitle={Proc. 8th ISCA Workshop on Speech Synthesis (SSW 8)}, pages={261--265} }