Second ESCA/IEEE Workshop on Speech Synthesis
September 12-15, 1994
Two decades ago we had a text-to-speech system that used an articulatory model [Coker, Umeda and Browman, "Automatic synthesis from ordinary English text," IEEE Trans., AU-21, 293-298 (1973); Coker, "A model of articulatory dynamics and control," Proc. IEEE, 64, 452-460 (1976) ]. Recently we have revived that system and included many serious changes. The system is integrated to the current AT&T TTS [Olive, Roe and Tschirgi, "Speech processing systems that listen, too," AT&T Technology, V6, N4, 1991], for analysis of numbers, abbreviations and acronyms, and for grammatical and phrase analysis.
An interface to a vocal-tract acoustic model [Sondhi & Schroeter, "A hybrid time-frequency domain articulatory speech synthesizer," IEEE Trans., ASSP-35, 955-967 (1987) ] exists, but is largely untested. Currently, the best results are produced by computing formants area function and driving a formant synthesizer. Use of the formant intermediary is a matter of history and convenience, rather than valid reason. I presently produce /r/ and /l/ by acoustic means, assign formant bandwidths, detail spectra of fricatives, and deal with a few other details in the frequency domain. The new incarnation has many differences from the articulatory synthesizer of the 70's. lateral acoustic modes in phoneme /I/. 2) a two-branch model of nasals, rather than pole-zero pairs; 3) more detailed representation of fricatives; 4) more realistic model of fricative amplitude as function of articulatory constric- tion and glottal adjustment; 5) a model of changing glottal voiced spectra as function of glottal 6) a model of aspiration during voicing.
At the physiological and acoustic level, there is a new model of glottal behavior, and boundary and stress. Also in the new incarnation are substantial changes in the strategy for articulatory motion between phonemes. This was based on articulatory mimic studies [Parthasarathy & Coker, "On automatic estimation of articulatory parameters in a text-to-speech system," Computer Speech &: Language, 6, 37-75 (1992) ], in which optimizations were done over typically syllable-length segments, and the feedback manipulations were done on phoneme-sized units: target values, times and speeds of transition.
The new version produces a quality of synthesis clearly superior to that of the earlier work. Spectrograms and even waveforms of the synthesis aren't casually distinguishable from natural speech. Improved spectral details, both in steady states and transients, make the synthesis understandable at rates in the order of 180 - 190 words a minute.
Bibliographic reference. Coker, Cecil H. (1994): "Articulatory text to speech", In SSW2-1994, 109.