A Hybrid System for Continuous Word-Level Emphasis Modeling Based on HMM State Clustering and Adaptive Training

Quoc Truong Do, Tomoki Toda, Graham Neubig, Sakriani Sakti, Satoshi Nakamura


Emphasis is an important aspect of speech that conveys the focus of utterances, and modeling of this emphasis has been an active research field. Previous work has modeled emphasis using state clustering with an emphasis contextual factor indicating whether or not a word is emphasized. In addition, cluster adaptive training (CAT) makes it possible to directly optimize model parameters for clusters with different characteristics. In this paper, we first make a straightforward extension of CAT to emphasis adaptive training using continuous emphasis representations. We then compare it to state clustering, and propose a hybrid approach that combines both the emphasis contextual factor and adaptive training. Experiments demonstrated the effectiveness of adaptive training both stand-alone or combined with the state clustering approach (hybrid system) with it improving emphasis estimation by 2–5% F-measure and producing more natural audio.


DOI: 10.21437/Interspeech.2016-930

Cite as

Do, Q.T., Toda, T., Neubig, G., Sakti, S., Nakamura, S. (2016) A Hybrid System for Continuous Word-Level Emphasis Modeling Based on HMM State Clustering and Adaptive Training. Proc. Interspeech 2016, 3196-3200.

Bibtex
@inproceedings{Do+2016,
author={Quoc Truong Do and Tomoki Toda and Graham Neubig and Sakriani Sakti and Satoshi Nakamura},
title={A Hybrid System for Continuous Word-Level Emphasis Modeling Based on HMM State Clustering and Adaptive Training},
year=2016,
booktitle={Interspeech 2016},
doi={10.21437/Interspeech.2016-930},
url={http://dx.doi.org/10.21437/Interspeech.2016-930},
pages={3196--3200}
}