Rephrasing Electronic Health Records for Pretraining Clinical Language Models
- URL: http://arxiv.org/abs/2411.18940v1
- Date: Thu, 28 Nov 2024 06:12:28 GMT
- Title: Rephrasing Electronic Health Records for Pretraining Clinical Language Models
- Authors: Jinghui Liu, Anthony Nguyen,
- Abstract summary: We rephrase existing clinical notes using LLMs to generate synthetic pretraining corpora.
We find that augmenting original clinical notes with synthetic corpora from different LLMs improves performances even at a small token budget.
- Score: 0.09819964822292428
- License:
- Abstract: Clinical language models are important for many applications in healthcare, but their development depends on access to extensive clinical text for pretraining. However, obtaining clinical notes from electronic health records (EHRs) at scale is challenging due to patient privacy concerns. In this study, we rephrase existing clinical notes using LLMs to generate synthetic pretraining corpora, drawing inspiration from previous work on rephrasing web data. We examine four popular small-sized LLMs (<10B) to create synthetic clinical text to pretrain both decoder-based and encoder-based language models. The method yields better results in language modeling and downstream tasks than previous synthesis approaches without referencing real clinical text. We find that augmenting original clinical notes with synthetic corpora from different LLMs improves performances even at a small token budget, showing the potential of this method to support pretraining at the institutional level or be scaled to synthesize large-scale clinical corpora.
Related papers
- Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation [8.567656208979475]
We introduce a novel Retrieval-Reasoning framework that leverages large language models to generate synthetic clinical trials.
Experiments conducted on real clinical trials from the urlClinicalTrials.gov database demonstrate that our synthetic data can effectively augment real datasets.
Our findings suggest that LLMs for synthetic clinical trial generation hold promise for accelerating clinical research and upholding ethical standards for patient privacy.
arXiv Detail & Related papers (2024-10-16T11:46:32Z) - Harmonising the Clinical Melody: Tuning Large Language Models for Hospital Course Summarisation in Clinical Coding [5.279406017862076]
The challenge of summarising a hospital course remains an open area for further research and development.
We adapted three pre trained LLMs, Llama 3, BioMistral, Mistral Instruct v0.1 for the hospital course summarisation task.
The fine tuned models were evaluated using BERTScore and ROUGE metrics to assess the effectiveness of clinical domain fine tuning.
arXiv Detail & Related papers (2024-09-23T00:35:23Z) - Synthetic4Health: Generating Annotated Synthetic Clinical Letters [6.822926897514792]
Since clinical letters contain sensitive information, clinical-related datasets can not be widely applied in model training, medical research, and teaching.
This work aims to generate reliable, various, and de-identified synthetic clinical letters.
arXiv Detail & Related papers (2024-09-14T18:15:07Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - Dynamic Q&A of Clinical Documents with Large Language Models [3.021316686584699]
This work introduces a natural language interface using large language models (LLMs) for dynamic question-answering on clinical notes.
Experiments, utilizing various embedding models and advanced LLMs, show Wizard Vicuna's superior accuracy, albeit with high compute demands.
arXiv Detail & Related papers (2024-01-19T14:50:22Z) - Do We Still Need Clinical Language Models? [15.023633270864675]
We show that relatively small specialized clinical models substantially outperform all in-context learning approaches.
We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement.
arXiv Detail & Related papers (2023-02-16T05:08:34Z) - Cross-Lingual Knowledge Transfer for Clinical Phenotyping [55.92262310716537]
We investigate cross-lingual knowledge transfer strategies to execute this task for clinics that do not use the English language.
We evaluate these strategies for a Greek and a Spanish clinic leveraging clinical notes from different clinical domains.
Our results show that using multilingual data overall improves clinical phenotyping models and can compensate for data sparseness.
arXiv Detail & Related papers (2022-08-03T08:33:21Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z) - An Interpretable End-to-end Fine-tuning Approach for Long Clinical Text [72.62848911347466]
Unstructured clinical text in EHRs contains crucial information for applications including decision support, trial matching, and retrospective research.
Recent work has applied BERT-based models to clinical information extraction and text classification, given these models' state-of-the-art performance in other NLP domains.
In this work, we propose a novel fine-tuning approach called SnipBERT. Instead of using entire notes, SnipBERT identifies crucial snippets and feeds them into a truncated BERT-based model in a hierarchical manner.
arXiv Detail & Related papers (2020-11-12T17:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.