![]() |
Second ESCA/IEEE Workshop on Speech SynthesisSeptember 12-15, 1994 |
![]() |
One of the enduring problems in achieving natural sounding synthetic
speech is that of getting the rhythm right. Usually this problem is
construed as the search for appropriate algorithms for altering durations
of segments under various contextual conditions (eg initially versus final
in word or phrase, in stressed versus unstressed syllables). Recently,
Campbell and Isard (1991) have suggested that a more effective model is
one in which the syllable is taken as the distinguished timing unit
and segmental durations accommodated secondarily to syllable durations.
We propose here that there is no distinguished timing unit While other
synthesis systems use phonemes, diphones or other linearly arranged
phone-sized units and employ 'hidden structure1, YorkTalk uses explicit
tree-like phonological representations.
We will compare the temporal characteristics of the output of the
YorkTalk system with Klattalk (Klatt, ms) on one hand and the naturalistic
observations of Fowler (1981) on the other. We will show that it is possible
to produce similar, natural sounding temporal relations by employing
linguistic structures which are given a compositional parametric and
temporal interpretation (Local, 1992; Ogden, 1992).
YorkTalk's metrical and phontactic parsers parse input into structures
consisting of feet, syllables and syllable constituents. In these
structures, the rime is the head of the syllable, the nucleus is
the head of the rime and the strong syllable is the head of the foot
(cf Coleman 1992). Every node in the graph is given a head-first
temporal and parametric phonetic interpretation. A co-production
model of coarticulation (cf Fowler 1980) is implemented in YorkTalk
by overlaying parameters. Since the nucleus is the head of the
syllable the nucleus and syllable are coextensive. By fitting the
onset and coda within the temporal space of the nucleus they inherit
the properties of the whole syllable. Where structures permit,
constituents are shared between syllables as shown below
(ambisyllabicity). The temporal interpretation of ambisyllabicity
is the temporal and parametric overlaying of one syllable on
another (Local, 1992; Ogden, 1992).
Bibliographic reference. Local, John / Ogden, Richard (1994): "A model of timiny for non-segmental phonological structure", In SSW2-1994, 236-239.