ISCA Archive Interspeech 2023
ISCA Archive Interspeech 2023

Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization

Siyuan Chen, Colin A. Grambow, Mojtaba Kadkhodaie Elyaderani, Alireza Sadeghi, Federico Fancellu, Thomas Schaaf

Large-scale pre-training has been a successful strategy for training transformer models. However, maintaining a large clinical dataset for pre-training is not always possible, and access to data in this domain can be time-limited and costly. We explore using synthetic data in pre-training sequence-to-sequence (seq-to-seq) transformer models to generate clinical notes from Doctor-Patient-Conversations (DoPaCos). Using a generative language model fine-tuned on authentic conversations, a synthetic DoPaCo dataset was created and used with a corpus of clinical notes to pre-train a Longformer-Encoder-Decoder (LED) model. Results show that synthetic data leads to comparable performance in the downstream summarization task compared to pre-training with authentic data. Pre-training on synthetic conversations first, followed by clinical notes, yields higher performance across most of our evaluation metrics.


doi: 10.21437/Interspeech.2023-913

Cite as: Chen, S., Grambow, C.A., Kadkhodaie Elyaderani, M., Sadeghi, A., Fancellu, F., Schaaf, T. (2023) Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization. Proc. INTERSPEECH 2023, 2338-2342, doi: 10.21437/Interspeech.2023-913

@inproceedings{chen23i_interspeech,
  author={Siyuan Chen and Colin A. Grambow and Mojtaba {Kadkhodaie Elyaderani} and Alireza Sadeghi and Federico Fancellu and Thomas Schaaf},
  title={{Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
  pages={2338--2342},
  doi={10.21437/Interspeech.2023-913}
}