Third ESCA/COCOSDA Workshop on Speech Synthesis

November 26-29, 1998
Jenolan Caves House, Blue Mountains, NSW, Australia

Synthetic Speech/Sound Control Language: MSCL

Osamu Mizuno, Shin'ya Nakajima

NTT Human Interface Laboratories, Yokosuka-shi, Kanagawa, Japan

The Multi-layered Speech/Sound Synthesis Control Language (MSCL) proposed herein facilitates the synthesizing of several speech modes such as nuance, mental state and emotion, and allows speech to be synchronized to other media easily. MSCL is a multi-layered linguistic system and encompasses three layers: and semantic level layer (The S-layer), interpretation level layer (The I-layer), and parameter level layer (The P-layer). This multi-level description system is convenient for both laymen and professional users. MSCL also encompasses many e ective prosodic feature control functions such as a time-varying pattern description function, absolute and relative control forms, and SDS(Speaker Dependent Scale). MSCL enables more natural and expressive synthetic speech than conventional TTS systems. Furthermore, research was conducted into mental state tendencies using a test that examined the perceptions of the subject's sensibility to the control of synthetic speech prosody. The results showed the relationships between prosodic control rules and non-verbal expressions. Duration control reflects information processing state in spoken dialogues. Sentence final pitch contour control reflects the reliability of the information. Pitch contour dynamic range control indicates the speaker's excitement. The pitch contour control from start to peak pitch contour indicates the speaker's requirement for attention. These relationships are of use for constructing semantic prosody control.

This paper describes these functions and the effective prosodic feature controls possible with MSCL.

