A web-based listening test measured intelligibility across speech rate of 8 TTS systems and a linearly time-compressed human speech reference voice. Four synthesis methods were compared: formant, diphone concatenation, unit selection concatenation, and HMM synthesis. For each TTS method, a female and a male American English voice from each of 2 independent synthesis engines were tested. Semantically unpredictable sentences were presented at 6 speech rates from 200 to 450 words per minute. In an open response format, listeners typed what they heard. Listener transcriptions were automatically scored at the word level, and a normalized edit distance per speech rate was calculated for each of 355 listeners. There were significant differences among the TTS systems. The two unit selection TTS systems were the most intelligible across speech rates; one was equivalent to human speech. Listeners' native language, TTS familiarity, and audio equipment were also significant factors.
Index Terms: speech synthesis, text-to-speech, intelligibility, speech rate
Bibliographic reference. Syrdal, Ann K. / Bunnell, H. Timothy / Hertz, Susan R. / Mishra, Taniya / Spiegel, Murray / Bickley, Corine / Rekart, Deborah / Makashay, Matthew J. (2012): "Text-to-speech intelligibility across speech rates", In INTERSPEECH-2012, 623-626.