Ninth International Conference on Spoken Language Processing

Pittsburgh, PA, USA
September 17-21, 2006

User Simulation for Spoken Dialogue Systems: Learning and Evaluation

Kallirroi Georgila, James Henderson, Oliver Lemon

University of Edinburgh, UK

We propose the "advanced" n-grams as a new technique for simulating user behaviour in spoken dialogue systems, and we compare it with two methods used in our prior work, i.e. linear feature combination and "normal" n-grams. All methods operate on the intention level and can incorporate speech recognition and understanding errors. In the linear feature combination model user actions (lists of { speech act, task } pairs) are selected, based on features of the current dialogue state which encodes the whole history of the dialogue. The user simulation based on "normal" n-grams treats a dialogue as a sequence of lists of { speech act, task } pairs. Here the length of the history considered is restricted by the order of the n-gram. The "advanced" n-grams are a variation of the normal n-grams, where user actions are conditioned not only on speech acts and tasks but also on the current status of the tasks, i.e. whether the information needed by the application (in our case flight booking) has been provided and confirmed by the user. This captures elements of goal-directed user behaviour. All models were trained and evaluated on the COMMUNICATOR corpus, to which we added annotations for user actions and dialogue context. We then evaluate how closely the synthetic responses resemble the real user responses by comparing the user response generated by each user simulation model in a given dialogue context (taken from the annotated corpus) with the actual user response. We propose the expected accuracy, expected precision, and expected recall evaluation metrics as opposed to standard precision and recall used in prior work. We also discuss why they are more appropriate metrics for evaluating user simulation models compared to their standard counterparts. The advanced n-grams produce higher scores than the normal n-grams for small values of n, which proves their strength when little amount of data is available to train larger n-grams. The linear model produces the best expected accuracy but with respect to expected precision and expected recall it is outperformed by the large n-grams even though it is trained using more information. As a task-based evaluation, we also run each of the user simulation models against a system policy trained on the same corpus. Here the linear feature combination model outperforms the other methods and the advanced n-grams outperform the normal n-grams for all values of n, which again shows their potential. We also calculate the perplexity of the different user models.

Full Paper

Bibliographic reference.  Georgila, Kallirroi / Henderson, James / Lemon, Oliver (2006): "User simulation for spoken dialogue systems: learning and evaluation", In INTERSPEECH-2006, paper 2035-Tue2A3O.6.