Related papers: MRScore: Evaluating Radiology Report Generation with LLM-based Reward System

MRScore: Evaluating Radiology Report Generation with LLM-based Reward System

URL: http://arxiv.org/abs/2404.17778v1
Date: Sat, 27 Apr 2024 04:42:45 GMT
Title: MRScore: Evaluating Radiology Report Generation with LLM-based Reward System
Authors: Yunyi Liu, Zhanyu Wang, Yingshu Li, Xinyu Liang, Lingqiao Liu, Lei Wang, Luping Zhou,
Abstract summary: This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs) To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics.
Score: 39.54237580336297
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations within this paper. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics. Our code and datasets will be available on GitHub.

Related papers

GEMA-Score: Granular Explainable Multi-Agent Score for Radiology Report Evaluation [8.071354543390274]
We propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper. GEMA-Score conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z)
ER2Score: LLM-based Explainable and Customizable Metric for Assessing Radiology Reports with Reward-Control Loss [39.542375803362965]
ER2Score is an automatic evaluation metric designed specifically for radiology report generation (R2Gen) It scores reports according to user-specified criteria and provides detailed sub-scores, enhancing interpretability. Our experiments demonstrate ER2Score's heightened correlation with human judgments and superior performance in model selection.
arXiv Detail & Related papers (2024-11-26T10:48:55Z)
Resource-Efficient Medical Report Generation using Large Language Models [3.2627279988912194]
Medical report generation is the task of automatically writing radiology reports for chest X-ray images. We propose a new framework leveraging vision-enabled Large Language Models (LLM) for the task of medical report generation.
arXiv Detail & Related papers (2024-10-21T05:08:18Z)
Clinical Context-aware Radiology Report Generation from Medical Images using Transformers [1.0878040851637998]
We investigate the use of the transformer model for radiology report generation from chest X-rays. We also highlight limitations in evaluating radiology report generation using only the standard language generation metrics.
arXiv Detail & Related papers (2024-08-21T05:04:25Z)
MGH Radiology Llama: A Llama 3 70B Model for Radiology [50.42811030970618]
This paper presents an advanced radiology-focused large language model: MGH Radiology Llama. It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2. Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
arXiv Detail & Related papers (2024-08-13T01:30:03Z)
X-ray Made Simple: Lay Radiology Report Generation and Robust Evaluation [22.09740244042415]
Radiology Report Generation (RRG) has advanced considerably with the development of multimodal generative models. RRG with high performance on existing lexical-based metrics might be more of a mirage - a model can get a high BLEU only by learning the template of reports. We propose a semantics-based evaluation method, which is effective in mitigating the inflated numbers of BLEU and provides more robust evaluation.
arXiv Detail & Related papers (2024-06-25T19:52:01Z)
RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore) RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z)
LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation [37.20505633019773]
evaluating generated radiology reports is crucial for the development of radiology AI. This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment.
arXiv Detail & Related papers (2024-04-01T09:02:12Z)
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations. The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations. ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z)
Radiology-Llama2: Best-in-Class Large Language Model for Radiology [71.27700230067168]
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-29T17:44:28Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.