Clinical Text Deduplication Practices for Efficient Pretraining and
Improved Clinical Tasks
- URL: http://arxiv.org/abs/2312.09469v1
- Date: Fri, 29 Sep 2023 18:35:52 GMT
- Title: Clinical Text Deduplication Practices for Efficient Pretraining and
Improved Clinical Tasks
- Authors: Isotta Landi, Eugenia Alleva, Alissa A. Valentine, Lauren A. Lepow,
Alexander W. Charney
- Abstract summary: We provide a fine-grained characterization of duplicates stemming from common writing practices and clinical relevancy.
We demonstrate that deduplicating clinical text can help clinical LMs encode less redundant information in a more efficient manner.
- Score: 39.65514468447604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite being a unique source of information on patients' status and disease
progression, clinical notes are characterized by high levels of duplication and
information redundancy. In general domain text, it has been shown that
deduplication does not harm language model (LM) pretraining, thus helping
reduce the training cost. Although large LMs have proven to learn medical
knowledge, they still require specialized domain adaptation for improved
downstream clinical tasks. By leveraging large real-world clinical corpora, we
first provided a fine-grained characterization of duplicates stemming from
common writing practices and clinical relevancy. Second, we demonstrated that
deduplicating clinical text can help clinical LMs encode less redundant
information in a more efficient manner and do not harm classification tasks via
prompt-based learning.
Related papers
- DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization [13.038800602897354]
We develop an adversarial approach using a large language model to re-identify the patient corresponding to a redacted clinical note.
Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note.
Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes.
arXiv Detail & Related papers (2024-10-22T14:06:31Z) - Harmonising the Clinical Melody: Tuning Large Language Models for Hospital Course Summarisation in Clinical Coding [5.279406017862076]
The challenge of summarising a hospital course remains an open area for further research and development.
We adapted three pre trained LLMs, Llama 3, BioMistral, Mistral Instruct v0.1 for the hospital course summarisation task.
The fine tuned models were evaluated using BERTScore and ROUGE metrics to assess the effectiveness of clinical domain fine tuning.
arXiv Detail & Related papers (2024-09-23T00:35:23Z) - Keyword-optimized Template Insertion for Clinical Information Extraction
via Prompt-based Learning [0.2939632869678985]
We develop a keyword-optimized template insertion method (KOTI) for clinical notes.
We show how it can improve performance on several clinical tasks in a zero-shot and few-shot training setting.
arXiv Detail & Related papers (2023-10-31T00:07:11Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine [68.7814360102644]
We propose the Re$3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning.
We demonstrate the effectiveness of our method in generating patient discharge instructions.
arXiv Detail & Related papers (2022-10-23T16:34:39Z) - Improving the Factual Accuracy of Abstractive Clinical Text
Summarization using Multi-Objective Optimization [3.977582258550673]
We propose a framework for improving the factual accuracy of abstractive summarization of clinical text using knowledge-guided multi-objective optimization.
In this study, we propose a framework for improving the factual accuracy of abstractive summarization of clinical text using knowledge-guided multi-objective optimization.
arXiv Detail & Related papers (2022-04-02T07:59:28Z) - Towards more patient friendly clinical notes through language models and
ontologies [57.51898902864543]
We present a novel approach to automated medical text based on word simplification and language modelling.
We use a new dataset pairs of publicly available medical sentences and a version of them simplified by clinicians.
Our method based on a language model trained on medical forum data generates simpler sentences while preserving both grammar and the original meaning.
arXiv Detail & Related papers (2021-12-23T16:11:19Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.