Towards End-to-End Spoken Dialogue Systems with Turn Embeddings

Ali Orkan Bayer, Evgeny A. Stepanov, Giuseppe Riccardi


Training task-oriented dialogue systems requires significant amount of manual effort and integration of many independently built components; moreover, the pipeline is prone to error-propagation. End-to-end training has been proposed to overcome these problems by training the whole system over the utterances of both dialogue parties. In this paper we present an end-to-end spoken dialogue system architecture that is based on turn embeddings. Turn embeddings encode a robust representation of user turns with a local dialogue history and they are trained using sequence-to-sequence models. Turn embeddings are trained by generating the previous and the next turns of the dialogue and additionally perform spoken language understanding. The end-to-end spoken dialogue system is trained using the pre-trained turn embeddings in a stateful architecture that considers the whole dialogue history. We observe that the proposed spoken dialogue system architecture outperforms the models based on local-only dialogue history and it is robust to automatic speech recognition errors.


 DOI: 10.21437/Interspeech.2017-1574

Cite as: Bayer, A.O., Stepanov, E.A., Riccardi, G. (2017) Towards End-to-End Spoken Dialogue Systems with Turn Embeddings. Proc. Interspeech 2017, 2516-2520, DOI: 10.21437/Interspeech.2017-1574.


@inproceedings{Bayer2017,
  author={Ali Orkan Bayer and Evgeny A. Stepanov and Giuseppe Riccardi},
  title={Towards End-to-End Spoken Dialogue Systems with Turn Embeddings},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={2516--2520},
  doi={10.21437/Interspeech.2017-1574},
  url={http://dx.doi.org/10.21437/Interspeech.2017-1574}
}