September 22-25, 1997
Controlling timing in text-to-speech synthesis systems is complicated, because there are many contextual factors that affect timing; moreover, which factors matter and what their precise effects are varies among languages. We describe here a language-independent approach for duration control. At run time, a language-independent timing module accesses language-specific tables. These tables specify which sub-classes of the feature space (i.e., all combinations of context and phone identity) are homogeneous in the specific sense that the same factors have similar effects on the cases in a sub-class. Within a sub-class, durations are modeled by simple arithmetic models such as multiplicative, additive, or - more generally - sums-of-products models. Exploratory statistical methods (supervised) and parameter estimation techniques (unsupervised) are used for table construction.
Bibliographic reference. Santen, Jan van / Shih, Chilin / Möbius, Bernd / Tzoukermann, Evelyne / Tanenblatt, Michael (1997): "Multi-lingual duration modeling", In EUROSPEECH-1997, 2651-2654.