ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics
- URL: http://arxiv.org/abs/2408.16208v1
- Date: Thu, 29 Aug 2024 02:03:05 GMT
- Title: ReXamine-Global: A Framework for Uncovering Inconsistencies in Radiology Report Generation Metrics
- Authors: Oishi Banerjee, Agustina Saenz, Kay Wu, Warren Clements, Adil Zia, Dominic Buensalido, Helen Kavnoudias, Alain S. Abi-Ghanem, Nour El Ghawi, Cibele Luna, Patricia Castillo, Khaled Al-Surimi, Rayyan A. Daghistani, Yuh-Min Chen, Heng-sheng Chao, Lars Heiliger, Moon Kim, Johannes Haubold, Frederic Jonske, Pranav Rajpurkar,
- Abstract summary: ReXamine-Global is a framework that tests metrics across different writing styles and patient populations.
We apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability.
- Score: 3.028298624225796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given the rapidly expanding capabilities of generative AI models for radiology, there is a need for robust metrics that can accurately measure the quality of AI-generated radiology reports across diverse hospitals. We develop ReXamine-Global, a LLM-powered, multi-site framework that tests metrics across different writing styles and patient populations, exposing gaps in their generalization. First, our method tests whether a metric is undesirably sensitive to reporting style, providing different scores depending on whether AI-generated reports are stylistically similar to ground-truth reports or not. Second, our method measures whether a metric reliably agrees with experts, or whether metric and expert scores of AI-generated report quality diverge for some sites. Using 240 reports from 6 hospitals around the world, we apply ReXamine-Global to 7 established report evaluation metrics and uncover serious gaps in their generalizability. Developers can apply ReXamine-Global when designing new report evaluation metrics, ensuring their robustness across sites. Additionally, our analysis of existing metrics can guide users of those metrics towards evaluation procedures that work reliably at their sites of interest.
Related papers
- RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - MRScore: Evaluating Radiology Report Generation with LLM-based Reward System [39.54237580336297]
This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs)
To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis.
Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics.
arXiv Detail & Related papers (2024-04-27T04:42:45Z) - Leveraging Professional Radiologists' Expertise to Enhance LLMs'
Evaluation for Radiology Reports [22.599250713630333]
Our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs)
Our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports.
Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19.
arXiv Detail & Related papers (2024-01-29T21:24:43Z) - Radiology-Aware Model-Based Evaluation Metric for Report Generation [5.168471027680258]
We propose a new automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain.
We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph.
Our results show that our metric correlates moderately to high with established metrics such as BERTscore, BLEU, and CheXbert scores.
arXiv Detail & Related papers (2023-11-28T13:08:26Z) - Comparative Experimentation of Accuracy Metrics in Automated Medical
Reporting: The Case of Otitis Consultations [0.5242869847419834]
Generative Artificial Intelligence can be used to automatically generate medical reports based on transcripts of medical consultations.
The accuracy of the generated reports needs to be established to ensure their correctness and usefulness.
There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting.
arXiv Detail & Related papers (2023-11-22T09:51:43Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics [53.779709191191685]
We propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics.
OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics.
We observe that existing metrics have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge.
arXiv Detail & Related papers (2021-05-19T04:45:07Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.