Large-scale pre-training has been a successful strategy for training transformer models. However, maintaining a large clinical dataset for pre-training is not always possible, and access to data in this domain can be time-limited and costly. We explore using synthetic data in pre-training sequence-to-sequence (seq-to-seq) transformer models to generate clinical notes from Doctor-Patient-Conversations (DoPaCos). Using a generative language model fine-tuned on authentic conversations, a synthetic DoPaCo dataset was created and used with a corpus of clinical notes to pre-train a Longformer-Encoder-Decoder (LED) model. Results show that synthetic data leads to comparable performance in the downstream summarization task compared to pre-training with authentic data. Pre-training on synthetic conversations first, followed by clinical notes, yields higher performance across most of our evaluation metrics.
Cite as: Chen, S., Grambow, C.A., Kadkhodaie Elyaderani, M., Sadeghi, A., Fancellu, F., Schaaf, T. (2023) Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization. Proc. INTERSPEECH 2023, 2338-2342, doi: 10.21437/Interspeech.2023-913
@inproceedings{chen23i_interspeech, author={Siyuan Chen and Colin A. Grambow and Mojtaba {Kadkhodaie Elyaderani} and Alireza Sadeghi and Federico Fancellu and Thomas Schaaf}, title={{Investigating the Utility of Synthetic Data for Doctor-Patient Conversation Summarization}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, pages={2338--2342}, doi={10.21437/Interspeech.2023-913} }