Eighth ISCA Workshop on Speech Synthesis
Barcelona, Catalonia, Spain
Conventional statistical parametric speech synthesis relies on decision trees to
cluster together similar contexts, resulting in tied-parameter context-dependent
hidden Markov models (HMMs). However, decision tree clustering has a major
weakness: it use hard division and subdivides the model space based on one feature
at a time, fragmenting the data and failing to exploit interactions between
linguistic context features. These linguistic features themselves are also problematic,
being noisy and of varied relevance to the acoustics.
We propose to combine our previous work on vector-space representations of linguistic context, which have the added advantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input. The outputs of the network are probability distributions over speech features. Maximum Likelihood Parameter Generation is then used to create parameter trajectories, which in turn drive a vocoder to generate the waveform.
Various configurations of the system are compared, using both conventional and vector space context representations and with the DNN making speech parameter predictions at two different temporal resolutions: frames, or states. Both objective and subjective results are presented. Index Terms: TTS, speech synthesis, deep neural network, vector space model, unsupervised learning
Bibliographic reference. Lu, Heng / King, Simon / Watts, Oliver (2013): "Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis", In SSW8, 261-265.