Leveraging Professional Radiologists' Expertise to Enhance LLMs'
Evaluation for Radiology Reports
- URL: http://arxiv.org/abs/2401.16578v3
- Date: Sat, 17 Feb 2024 03:07:00 GMT
- Title: Leveraging Professional Radiologists' Expertise to Enhance LLMs'
Evaluation for Radiology Reports
- Authors: Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan
Mathai, Pritam Mukherjee, Xin Gao, Ronald M Summers, Zhiyong Lu
- Abstract summary: Our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs)
Our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports.
Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19.
- Score: 22.599250713630333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In radiology, Artificial Intelligence (AI) has significantly advanced report
generation, but automatic evaluation of these AI-produced reports remains
challenging. Current metrics, such as Conventional Natural Language Generation
(NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic
intricacies of clinical contexts or overemphasize clinical details, undermining
report clarity. To overcome these issues, our proposed method synergizes the
expertise of professional radiologists with Large Language Models (LLMs), like
GPT-3.5 and GPT-4 1. Utilizing In-Context Instruction Learning (ICIL) and Chain
of Thought (CoT) reasoning, our approach aligns LLM evaluations with
radiologist standards, enabling detailed comparisons between human and AI
generated reports. This is further enhanced by a Regression model that
aggregates sentence evaluation scores. Experimental results show that our
"Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR
metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment
with expert evaluations, exceeding the best existing metric by a 0.35 margin.
Moreover, the robustness of our explanations has been validated through a
thorough iterative strategy. We plan to publicly release annotations from
radiology experts, setting a new standard for accuracy in future assessments.
This underscores the potential of our approach in enhancing the quality
assessment of AI-driven medical reports.
Related papers
- Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports [1.5972172622800358]
This study introduces a pipeline for developing in-house LLMs tailored to identify differential diagnoses from radiology reports.
evaluated on a set of 1,067 reports annotated by clinicians, the proposed model achieves an average F1 score of 92.1%, which is on par with GPT-4.
arXiv Detail & Related papers (2024-10-11T20:16:25Z) - RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore)
RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions.
Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z) - LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation [37.20505633019773]
evaluating generated radiology reports is crucial for the development of radiology AI.
This study proposes a novel evaluation framework using large language models (LLMs) to compare radiology reports for assessment.
arXiv Detail & Related papers (2024-04-01T09:02:12Z) - Fine-tuning Large Language Models for Automated Diagnostic Screening Summaries [0.024105148723769353]
We evaluate several state-of-the-art Large Language Models (LLMs) for generating concise summaries from mental state examinations.
We rigorously evaluate four different models for summary generation using established ROUGE metrics and input from human evaluators.
Our top-performing fine-tuned model outperforms existing models, achieving ROUGE-1 and ROUGE-L values of 0.810 and 0.764, respectively.
arXiv Detail & Related papers (2024-03-29T12:25:37Z) - Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers [0.29530625605275984]
structured reporting (SR) has been recommended by various medical societies.
We propose a pipeline to extract information from free-text reports.
Our work aims to leverage the potential of Natural Language Processing (NLP) and Transformer-based models.
arXiv Detail & Related papers (2024-03-27T18:38:39Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Exploring the Boundaries of GPT-4 in Radiology [46.30976153809968]
GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context.
For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
arXiv Detail & Related papers (2023-10-23T05:13:03Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - Radiology-Llama2: Best-in-Class Large Language Model for Radiology [71.27700230067168]
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning.
Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-29T17:44:28Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.