![]() |
Eighth ISCA Workshop on Speech SynthesisBarcelona, Catalonia, Spain |
![]() |
Conventional statistical parametric speech synthesis relies on decision trees to
cluster together similar contexts, resulting in tied-parameter context-dependent
hidden Markov models (HMMs). However, decision tree clustering has a major
weakness: it use hard division and subdivides the model space based on one feature
at a time, fragmenting the data and failing to exploit interactions between
linguistic context features. These linguistic features themselves are also problematic,
being noisy and of varied relevance to the acoustics.
We propose to combine
our previous work on vector-space representations of linguistic context, which
have the added advantage of working directly from textual input, and Deep Neural
Networks (DNNs), which can directly accept such continuous representations
as input. The outputs of the network are probability distributions over speech
features. Maximum Likelihood Parameter Generation is then used to create parameter
trajectories, which in turn drive a vocoder to generate the waveform.
Various configurations of the system are compared, using both conventional and
vector space context representations and with the DNN making speech parameter
predictions at two different temporal resolutions: frames, or states. Both objective
and subjective results are presented.
Index Terms: TTS, speech synthesis, deep neural network,
vector space model, unsupervised learning
Bibliographic reference. Lu, Heng / King, Simon / Watts, Oliver (2013): "Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis", In SSW8, 261-265.