![]() |
ITRW on Speech and EmotionSeptember 5-7, 2000 |
![]() |
Whilst TD-PSOLA remains an adequate solution for neutral speaking styles, it is less suitable for synthesising emotions, which require more extreme pitch manipulation. With TDPSOLA, extreme pitch manipulation can introduce distortions into the synthetic speech. These distortions could be reduced by recording concatenative units at a pitch which is similar to the target intonation. A recording method called Reference Pitch Prompting has thus been devised in which a speaker records concatenative units at a set pitch, guided by a ‘Reference Pitch Prompt’ (RPP), which is a monotonic, hummed note. Speech is synthesised by concatenating RPP-recorded syllables, potentially of different f0 values, and manipulating prosody using TD-PSOLA. This synthesis method, called Multiple Pitch RP-PSOLA, involves selecting concatenative units to approximate to the target f0 contour. In Multiple Pitch RPPSOLA the waveform inventory contains several versions of each syllable, each at a different pitch.
Multiple Pitch RP-PSOLA is an extended version of Single Pitch RP-PSOLA, which uses only monotonic speech units. The Single Pitch RP-PSOLA and Multiple Pitch RP-PSOLA synthesis methods were compared in terms of perceived distortion, via a listening experiment. The stimuli were synthetic sentences based on three different emotions. Intonation contours were based on a corpus of emotionally spoken sentences. The Multiple Pitch RP-PSOLA stimuli were perceived to be slightly less distorted than Single Pitch RP-PSOLA stimuli.
Bibliographic reference. Vine, D. S. G. / Sahandi, R. (2000): "Synthesising emotional speech by concatenating multiple pitch recorded speech units", In SpeechEmotion-2000, 157-160.