16th Annual Conference of the International Speech Communication Association

Dresden, Germany
September 6-10, 2015

Using Automatic Stress Extraction from Audio for Improved Prosody Modelling in Speech Synthesis

György Szaszák (1), András Beke (2), Gábor Olaszy (1), Bálint Pál Tóth (1)

(1) BME, Hungary
(2) Hungarian Academy of Sciences, Hungary

Generating proper and natural sounding prosody is one of the key interests of today's speech synthesis research. An important factor in this effort is the availability of a precisely labelled speech corpus with adequate prosodic stress marking. Obtaining such a labelling constitutes a huge effort, whereas inter-annotator agreement scores are usually found far below 100%. Stress marking based on phonetic transcription is an alternative, but yields even poorer quality than human annotation. Applying an automatic labelling may help overcoming these difficulties. The current paper presents an automatic approach for stress detection based purely on audio, which is used to derive an automatic, layered labelling of stress events and link them to syllables. For proof of concept, a speech corpus was extended by the output of the stress detection algorithm and a HMM-TTS system was trained with the extended corpus. Results are compared to a baseline system, trained on the same database, but with stress marking obtained from textual transcripts after applying a set of linguistic rules. The evaluation includes CMOS tests and the analysis of the decision trees. Results show an overall improvement in prosodic properties of the synthesized speech. Subjective ratings reveal a voice perceived as more natural.

Full Paper

Bibliographic reference.  Szaszák, György / Beke, András / Olaszy, Gábor / Tóth, Bálint Pál (2015): "Using automatic stress extraction from audio for improved prosody modelling in speech synthesis", In INTERSPEECH-2015, 2227-2231.