MRScore: Evaluating Radiology Report Generation with LLM-based Reward System
- URL: http://arxiv.org/abs/2404.17778v1
- Date: Sat, 27 Apr 2024 04:42:45 GMT
- Title: MRScore: Evaluating Radiology Report Generation with LLM-based Reward System
- Authors: Yunyi Liu, Zhanyu Wang, Yingshu Li, Xinyu Liang, Lingqiao Liu, Lei Wang, Luping Zhou,
- Abstract summary: This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs)
To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis.
Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics.
- Score: 39.54237580336297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations within this paper. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics. Our code and datasets will be available on GitHub.
Related papers
- X-ray Made Simple: Radiology Report Generation and Evaluation with Layman's Terms [25.871814979179373]
Radiology Report Generation (RRG) has achieved significant progress with the advancements of multimodal generative models.
High performance on RRG with existing lexical-based metrics (e.g. BLEU) might be more of a mirage - a model can get a high BLEU only by learning the template of reports.
We un-intuitively approach this problem by proposing the Layman's RRG framework, a layman's terms-based dataset, evaluation and training framework.
arXiv Detail & Related papers (2024-06-25T19:52:01Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation [37.20505633019773]
evaluating generated radiology reports is crucial for the development of radiology AI.
This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment.
arXiv Detail & Related papers (2024-04-01T09:02:12Z) - Radiology-Aware Model-Based Evaluation Metric for Report Generation [5.168471027680258]
We propose a new automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain.
We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph.
Our results show that our metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores.
arXiv Detail & Related papers (2023-11-28T13:08:26Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Radiology-Llama2: Best-in-Class Large Language Model for Radiology [71.27700230067168]
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning.
Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-29T17:44:28Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Improving Factual Completeness and Consistency of Image-to-Text
Radiology Report Generation [26.846912996765447]
We introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports.
We show that our system leads to generations that are more factually complete and consistent compared to the baselines.
arXiv Detail & Related papers (2020-10-20T05:42:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.