A new method for speech synthesis by concatenating waveforms selected from a dictionary is described. An adult male recorded a two-hour speech with acoustic phonetic labels. This data was used to construct the dictionary. The dictionary contains 35,000 waveforms which are identified by their duration, average pitch, pitch contour and average energy. The number of the phonetic labels is thirty-five. In the speech synthesis phase, given a phoneme string and prosody information, the optimum waveforms are selected by matching their attributes with the given phonetic and prosodic information. The matching score is defined as a function of phonetic coincidence and prosodic attribute differences. Selected waveforms are then concatenated to produce speech. The speech has high intelligibility and naturalness.
Bibliographic reference. Hirokawa, Tomohisa (1989): "Speech synthesis using a waveform dictionary", In EUROSPEECH-1989, 1140-1143.