In speech, emphasis is an important type of paralinguistic information that helps convey the focus of an utterance, new information, and emotion. If emphasis can be incorporated into a speech-to-speech (S2S) translation system, it will be possible to convey this information across the language barrier. However, previous related work focuses only on the translation of particular prosodic features, such as F0, or works with emphasis but focuses on extremely small vocabularies, such as the 10 digits. In this paper, we describe a new S2S method that is able to translate the emphasis across languages and consider multiple features of emphasis such as power, F0, and duration over larger vocabularies. We do so by introducing two new components: word-level emphasis estimation using linear regression hidden semi-Markov models, and emphasis translation that translates the word-level emphasis to the target language with conditional random fields. The text-to-speech synthesis system is also modified to be able to synthesize emphasized speech. The result shows that our system can translate the emphasis correctly with 91.6% F-measure for objective test, and 87.8% for subjective test.
Bibliographic reference. Do, Quoc Truong / Takamichi, Shinnosuke / Sakti, Sakriani / Neubig, Graham / Toda, Tomoki / Nakamura, Satoshi (2015): "Preserving word-level emphasis in speech-to-speech translation using linear regression HSMMs", In INTERSPEECH-2015, 3665-3669.