GREEN: Generative Radiology Report Evaluation and Error Notation
- URL: http://arxiv.org/abs/2405.03595v1
- Date: Mon, 6 May 2024 16:04:03 GMT
- Title: GREEN: Generative Radiology Report Evaluation and Error Notation
- Authors: Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, Jean-Benoit Delbrouck,
- Abstract summary: Green is a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports.
Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts.
- Score: 14.31646900556454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches."
Related papers
- GPT-4V Cannot Generate Radiology Reports Yet [25.331936045860516]
GPT-4V's purported strong multimodal abilities raise interests in using it to automate radiology report writing.
We attempt to directly generate reports using GPT-4V through different prompting strategies and find that it fails terribly in both lexical metrics and clinical efficacy metrics.
arXiv Detail & Related papers (2024-07-16T21:03:14Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers [0.29530625605275984]
structured reporting (SR) has been recommended by various medical societies.
We propose a pipeline to extract information from free-text reports.
Our work aims to leverage the potential of Natural Language Processing (NLP) and Transformer-based models.
arXiv Detail & Related papers (2024-03-27T18:38:39Z) - Leveraging Professional Radiologists' Expertise to Enhance LLMs'
Evaluation for Radiology Reports [22.599250713630333]
Our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs)
Our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports.
Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19.
arXiv Detail & Related papers (2024-01-29T21:24:43Z) - Comparative Experimentation of Accuracy Metrics in Automated Medical
Reporting: The Case of Otitis Consultations [0.5242869847419834]
Generative Artificial Intelligence can be used to automatically generate medical reports based on transcripts of medical consultations.
The accuracy of the generated reports needs to be established to ensure their correctness and usefulness.
There are several metrics for measuring the accuracy of AI generated reports, but little work has been done towards the application of these metrics in medical reporting.
arXiv Detail & Related papers (2023-11-22T09:51:43Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Radiology-Llama2: Best-in-Class Large Language Model for Radiology [71.27700230067168]
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning.
Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-29T17:44:28Z) - Learning to diagnose cirrhosis from radiological and histological labels
with joint self and weakly-supervised pretraining strategies [62.840338941861134]
We propose to leverage transfer learning from large datasets annotated by radiologists, to predict the histological score available on a small annex dataset.
We compare different pretraining methods, namely weakly-supervised and self-supervised ones, to improve the prediction of the cirrhosis.
This method outperforms the baseline classification of the METAVIR score, reaching an AUC of 0.84 and a balanced accuracy of 0.75.
arXiv Detail & Related papers (2023-02-16T17:06:23Z) - Improving the Factual Correctness of Radiology Report Generation with
Semantic Rewards [9.175022232984709]
We propose a new method, the RadGraph reward, to further improve the factual completeness and correctness of generated radiology reports.
Our system substantially improves the scores up to 14.2% and 25.3% on metrics evaluating the factual correctness and completeness of reports.
arXiv Detail & Related papers (2022-10-21T18:27:45Z) - Mitigating Biases in Toxic Language Detection through Invariant
Rationalization [70.36701068616367]
biases toward some attributes, including gender, race, and dialect, exist in most training datasets for toxicity detection.
We propose to use invariant rationalization (InvRat), a game-theoretic framework consisting of a rationale generator and a predictor, to rule out the spurious correlation of certain syntactic patterns.
Our method yields lower false positive rate in both lexical and dialectal attributes than previous debiasing methods.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - CLARA: Clinical Report Auto-completion [56.206459591367405]
CLinicit Al it Report it Auto-completion (CLARA) is an interactive method that generates reports in a sentence by sentence fashion based on doctors' anchor words and partially completed sentences.
In our experimental evaluation, CLARA achieved 0.393 CIDEr and 0.248 BLEU-4 on X-ray reports and 0.482 CIDEr and 0.491 BLEU-4 for EEG reports for sentence-level generation.
arXiv Detail & Related papers (2020-02-26T18:45:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.