Synthetic Imitation Edit Feedback for Factual Alignment in Clinical
Summarization
- URL: http://arxiv.org/abs/2310.20033v2
- Date: Fri, 3 Nov 2023 13:49:16 GMT
- Title: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical
Summarization
- Authors: Prakamya Mishra, Zonghai Yao, Shuwei Chen, Beining Wang, Rohan Mittal,
Hong Yu
- Abstract summary: Large Language Models (LLMs) have demonstrated exceptional capabilities in capturing critical contextual information.
LLMs sometimes generate factually hallucinated summaries, which can be extremely harmful in the clinical domain.
We propose a new pipeline using ChatGPT instead of human experts to generate high-quality feedback data.
- Score: 7.765365251963273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) like the GPT and LLaMA families have
demonstrated exceptional capabilities in capturing and condensing critical
contextual information and achieving state-of-the-art performance in the
summarization task. However, community concerns about these models'
hallucination issues continue to rise. LLMs sometimes generate factually
hallucinated summaries, which can be extremely harmful in the clinical domain
NLP tasks (e.g., clinical note summarization), where factually incorrect
statements can lead to critically erroneous diagnoses. Fine-tuning LLMs using
human feedback has shown the promise of aligning LLMs to be factually
consistent during generation, but such training procedure requires high-quality
human-annotated data, which can be extremely expensive to get in the clinical
domain. In this work, we propose a new pipeline using ChatGPT instead of human
experts to generate high-quality feedback data for improving factual
consistency in the clinical note summarization task. We focus specifically on
edit feedback because recent work discusses the shortcomings of human alignment
via preference feedback in complex situations (such as clinical NLP tasks that
require extensive expert knowledge), as well as some advantages of collecting
edit feedback from domain experts. In addition, although GPT has reached the
expert level in many clinical NLP tasks (e.g., USMLE QA), there is not much
previous work discussing whether GPT can generate expert-level edit feedback
for LMs in the clinical note summarization task. We hope to fill this gap.
Finally, our evaluations demonstrate the potential use of GPT edits in human
alignment, especially from a factuality perspective.
Related papers
- The LLM Effect: Are Humans Truly Using LLMs, or Are They Being Influenced By Them Instead? [60.01746782465275]
Large Language Models (LLMs) have shown capabilities close to human performance in various analytical tasks.
This paper investigates the efficiency and accuracy of LLMs in specialized tasks through a structured user study focusing on Human-LLM partnership.
arXiv Detail & Related papers (2024-10-07T02:30:18Z) - PALLM: Evaluating and Enhancing PALLiative Care Conversations with Large Language Models [10.258261180305439]
Large language models (LLMs) offer a new approach to assessing complex communication metrics.
LLMs offer the potential to advance the field through integration into passive sensing and just-in-time intervention systems.
This study explores LLMs as evaluators of palliative care communication quality, leveraging their linguistic, in-context learning, and reasoning capabilities.
arXiv Detail & Related papers (2024-09-23T16:39:12Z) - Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding [53.629132242389716]
Vision-Language Models (VLM) can support clinicians by analyzing medical images and engaging in natural language interactions.
VLMs often exhibit "hallucinogenic" behavior, generating textual outputs not grounded in contextual multimodal information.
We propose a new alignment algorithm that uses symbolic representations of clinical reasoning to ground VLMs in medical knowledge.
arXiv Detail & Related papers (2024-05-29T23:19:28Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - SYNFAC-EDIT: Synthetic Imitation Edit Feedback for Factual Alignment in Clinical Summarization [6.130435789368263]
Large Language Models (LLMs) have demonstrated significant achievements in summarization tasks but struggle with factual inaccuracies.
To counter the high costs and limited availability of expert-annotated data for factual alignment, this study introduces an innovative pipeline.
We leverage 100B+ GPT variants to act as synthetic feedback experts offering expert-level edit feedback.
arXiv Detail & Related papers (2024-02-21T16:33:22Z) - Context Matters: Data-Efficient Augmentation of Large Language Models
for Scientific Applications [15.893290942177112]
We explore the challenges inherent to Large Language Models (LLMs) like GPT-4.
The capacity of LLMs to present erroneous answers in a coherent and semantically rigorous manner complicates the detection of factual inaccuracies.
Our work aims to enhance the understanding and mitigation of such errors, thereby contributing to the improvement of LLM accuracy and reliability.
arXiv Detail & Related papers (2023-12-12T08:43:20Z) - Towards Mitigating Hallucination in Large Language Models via
Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks.
This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z) - Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization [8.456700096020601]
Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven.
In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks.
A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
arXiv Detail & Related papers (2023-09-14T05:15:01Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.