![]() |
Fourth ISCA ITRW on Speech SynthesisAugust 29 - September 1, 2001 |
![]() |
The definition of cost terms in unit selection based synthesis is a difficult task. Usually cost terms are based upon cOmmon phonetic knowledge of the developers and subsequent perceptual experiments. The dataset used for supervised learning, well known from pattern recognition, could be a useful way to arrive at a more formal analysis of the different factors influencing the selection of units.
As a first step toward this aim we present an objective distance measure which is used to sort the units contained in the corpus in relation to a given natural unit and prove its relevance to human perception. To avoid too much attention of the listeners to discontinuities caused by concatenation, we will also present a waveform-based smoothing algorithm.
It is experimentaily shown that the sorting criterion and the human perception match in most cases. Furthermore it can be detected that similarity between natural and synthetic speech is better if phoneme-based units are used, but naturalness increases with the concatenation of larger units.
Bibliographic reference. Stöber, Karlheinz / Wagner, Petra / Klabbers, Esther / Hess, Wolfgang (2001): "Definition of a training set for unit selection-based speech synthesis", In SSW4-2001, paper 118.
There are two pairs of sentences.
All examples are generated without prosodic manipulations by the presented
procedure.
Sentence 1
Original unmanipulated speech
Raw concatenation of phonemes
Raw concatenation of phonemes and smoothed concatenation boundaries
Raw concatenation of word or subword units
Sentence 1
Original unmanipulated speech
Raw concatenation of phonemes
Raw concatenation of phonemes and smoothed concatenation boundaries
Raw concatenation of word or subword units