Natural language generation (NLG) systems rely on corpora for both
hand-crafted approaches in a traditional NLG architecture and for statistical
end-to-end (learned) generation systems. Limitations in existing resources,
however, make it difficult to develop systems which can vary the linguistic
properties of an utterance as needed. For example, when users’
attention is split between a linguistic and a secondary task such as
driving, a generation system may need to reduce the information density
of an utterance to compensate for the reduction in user attention.
We introduce a new corpus in the restaurant recommendation and
comparison domain, collected in a paraphrasing paradigm, where subjects
wrote texts targeting either a general audience or an elderly family
member. This design resulted in a corpus of more than 5000 texts which
exhibit a variety of lexical and syntactic choices and differ with
respect to average word & sentence length and surprisal. The corpus
includes two levels of meaning representation: flat ‘semantic
stacks’ for propositional content and Rhetorical Structure Theory
(RST) relations between these propositions.
Cite as: Howcroft, D.M., Klakow, D., Demberg, V. (2017) The Extended SPaRKy Restaurant Corpus: Designing a Corpus with Variable Information Density. Proc. Interspeech 2017, 3757-3761, doi: 10.21437/Interspeech.2017-1555
@inproceedings{howcroft17_interspeech, author={David M. Howcroft and Dietrich Klakow and Vera Demberg}, title={{The Extended SPaRKy Restaurant Corpus: Designing a Corpus with Variable Information Density}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3757--3761}, doi={10.21437/Interspeech.2017-1555} }