People working on spoken language technology are, amongst other things, confronted with the multiformity of the surface phenomenon speech. Automatic speech recognition is extremely difficult because there is a strong inter- and intra-speaker variability. People speak different in different situations. Developers of speech synthesis systems try to map that multiformity of natural speech onto automatically generalized speech. For both fields, speech recognition and speech synthesis, it is of central importance to understand the underlying principles of speech production and perception. Different scientific disciplines have contributed to collect information on this issue. In this paper a psycho-acoustic approach is described. Similarity profiles representing spaces of perceptual distinction are presented: Profile A is based on judgements gained in an introspective way, Profile B visualizes judgements on natural speech, and Profile C on synthetic speech. The study shows that there are quite severe differences compared to natural speech. The paper will concentrate on describing the perceptual dimensional representations of natural and synthetic speech. Data are compared and interpreted with regard to their role in synthesis assessment. A detailed analysis of test results will give some indications of why speech synthesizers often suffer from intelligibility and acceptability.
Bibliographic reference. Jekosch, Ute (1993): "Cluster-similarity: a useful database for speech processing", In EUROSPEECH'93, 195-198.