15th Annual Conference of the International Speech Communication Association

September 14-18, 2014

Dialogue Context Sensitive Speech Synthesis Using Factorized Decision Trees

Pirros Tsiakoulis, Catherine Breslin, M. Gašić, Matthew Henderson, Dongho Kim, Steve Young

University of Cambridge, UK

This paper extends our recent work on rich context utilization for expressive speech synthesis in spoken dialogue systems in which significant improvements to the appropriateness of HMM-based synthetic voices were achieved by introducing dialogue context into the decision tree state clustering stage. Continuing in this direction, this paper investigates the performance of dialogue context-sensitive voices in different domains. The Context Adaptive Training with Factorized Decision trees (FD-CAT) approach was used to train a dialogue context-sensitive synthetic voice which was then compared to a baseline system using the standard decision tree approach. Preference-based listening tests were conducted for two different domains. The first domain concerned restaurant information and had significant coverage in the training data, while the second dealing with appointment bookings had minimal coverage in the training data. No significant preference was found for any of the voices when tested in the restaurant domain whereas in the appointment booking domain, listeners showed a statistically significant preference for the adaptively trained voice.

Full Paper

Bibliographic reference.  Tsiakoulis, Pirros / Breslin, Catherine / Gašić, M. / Henderson, Matthew / Kim, Dongho / Young, Steve (2014): "Dialogue context sensitive speech synthesis using factorized decision trees", In INTERSPEECH-2014, 2937-2941.