A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
- URL: http://arxiv.org/abs/2402.15422v2
- Date: Tue, 25 Jun 2024 17:02:10 GMT
- Title: A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
- Authors: Stefan Hegselmann, Shannon Zejiang Shen, Florian Gierse, Monica Agrawal, David Sontag, Xiaoyi Jiang,
- Abstract summary: Fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2.
We find that common quantitative metrics do not correlate well with faithfulness and quality.
- Score: 11.218649399559691
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.
Related papers
- MedHal: An Evaluation Dataset for Medical Hallucination Detection [2.5782420501870296]
We present MedHal, a novel large-scale dataset specifically designed to evaluate if models can detect hallucinations in medical texts.
MedHal addresses gaps by: (1) incorporating diverse medical text sources and tasks; (2) providing a substantial volume of annotated samples suitable for training medical hallucination detection models; and (3) including explanations for factual inconsistencies to guide model learning.
arXiv Detail & Related papers (2025-04-11T14:55:15Z) - HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection [1.8230982862848586]
We aim to provide a nuanced, model-aware understanding of hallucination occurrences and severity in English.
We used natural language inference and fine-tuned a ModernBERT model using a synthetic dataset of 400 samples.
Results indicate a moderately positive correlation between the model's confidence scores and the actual presence of hallucinations.
arXiv Detail & Related papers (2025-03-25T13:40:22Z) - KSHSeek: Data-Driven Approaches to Mitigating and Detecting Knowledge-Shortcut Hallucinations in Generative Models [17.435794516702256]
Large language models (LLMs) have significantly advanced the development of natural language processing (NLP)
Model hallucinations remain a major challenge in natural language generation (NLG) tasks due to their complex causes.
This work introduces a new paradigm for mitigating specific hallucination issues in generative models, enhancing their robustness and reliability in real-world applications.
arXiv Detail & Related papers (2025-03-25T09:18:27Z) - Medical Hallucinations in Foundation Models and Their Impact on Healthcare [53.97060824532454]
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine.
We define medical hallucination as any instance in which a model generates misleading medical content.
Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates.
These findings underscore the ethical and practical imperative for robust detection and mitigation strategies.
arXiv Detail & Related papers (2025-02-26T02:30:44Z) - MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models [82.30696225661615]
We introduce MedHallu, the first benchmark specifically designed for medical hallucination detection.
We show that state-of-the-art LLMs, including GPT-4o, Llama-3.1, and medically fine-tuned UltraMedical, struggle with this binary hallucination detection task.
Using bidirectional entailment clustering, we show that harder-to-detect hallucinations are semantically closer to ground truth.
arXiv Detail & Related papers (2025-02-20T06:33:23Z) - The Effects of Hallucinations in Synthetic Training Data for Relation Extraction [11.046770690972723]
We study the effects of hallucinations on the performance of relation extraction on the document and sentence levels.
Hallucinations considerably compromise the ability of models to extract relations from text.
We develop methods for the detection of hallucinations to improve data quality and model performance.
arXiv Detail & Related papers (2024-10-10T22:00:16Z) - Generating Faithful and Complete Hospital-Course Summaries from the Electronic Health Record [3.6513957125331555]
An unintended consequence of the increased documentation burden has been reduced face-time with patients.
We propose and evaluate automated solutions for generating a summary of a patient's hospital admissions.
arXiv Detail & Related papers (2024-04-01T15:47:21Z) - Comparing Hallucination Detection Metrics for Multilingual Generation [62.97224994631494]
This paper assesses how well various factual hallucination detection metrics identify hallucinations in generated biographical summaries across languages.
We compare how well automatic metrics correlate to each other and whether they agree with human judgments of factuality.
Our analysis reveals that while the lexical metrics are ineffective, NLI-based metrics perform well, correlating with human annotations in many settings and often outperforming supervised models.
arXiv Detail & Related papers (2024-02-16T08:10:34Z) - Fine-grained Hallucination Detection and Editing for Language Models [109.56911670376932]
Large language models (LMs) are prone to generate factual errors, which are often called hallucinations.
We introduce a comprehensive taxonomy of hallucinations and argue that hallucinations manifest in diverse forms.
We propose a novel task of automatic fine-grained hallucination detection and construct a new evaluation benchmark, FavaBench.
arXiv Detail & Related papers (2024-01-12T19:02:48Z) - HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data [102.56792377624927]
hallucinations inherent in machine-generated data remain under-explored.
We present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm.
Our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA.
arXiv Detail & Related papers (2023-11-22T04:52:58Z) - Factored Verification: Detecting and Reducing Hallucination in Summaries
of Academic Papers [1.7100359620532977]
We use Factored Verification to detect hallucinations in abstractive summaries.
We estimate how often language models hallucinate when summarizing across multiple academic papers.
The hallucinations we find are often subtle, so we advise caution when using models to synthesize academic papers.
arXiv Detail & Related papers (2023-10-16T17:51:17Z) - Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine [68.7814360102644]
We propose the Re$3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning.
We demonstrate the effectiveness of our method in generating patient discharge instructions.
arXiv Detail & Related papers (2022-10-23T16:34:39Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Adding more data does not always help: A study in medical conversation
summarization with PEGASUS [5.276054618115727]
We study the effect of dataset size on transfer learning medical conversation summarization using PEG.
We also evaluate various iterative labeling strategies in the low-data regime, following their success in the classification setting.
Our work sheds light on the successes and challenges of translating low-data regime techniques in classification to medical conversation summarization.
arXiv Detail & Related papers (2021-11-15T07:27:35Z) - Detecting Hallucinated Content in Conditional Neural Sequence Generation [165.68948078624499]
We propose a task to predict whether each token in the output sequence is hallucinated (not contained in the input)
We also introduce a method for learning to detect hallucinations using pretrained language models fine tuned on synthetic data.
arXiv Detail & Related papers (2020-11-05T00:18:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.