Related papers: Semantic Similarity in Radiology Reports via LLMs and NER

Semantic Similarity in Radiology Reports via LLMs and NER

URL: http://arxiv.org/abs/2510.03102v2
Date: Mon, 06 Oct 2025 14:04:39 GMT
Title: Semantic Similarity in Radiology Reports via LLMs and NER
Authors: Beth Pearson, Ahmed Adnan, Zahraa S. Abdallah,
Abstract summary: Radiology report evaluation is a crucial part of radiologists' training and plays a key role in ensuring diagnostic accuracy.<n>Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge.<n>While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge.<n>We propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences.
Score: 1.2489632787815885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Radiology report evaluation is a crucial part of radiologists' training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: https://github.com/otmive/llama_reports

Related papers

MRG-R1: Reinforcement Learning for Clinically Aligned Medical Report Generation [23.22547135801011]
We propose a semantic-driven reinforcement learning (SRL) method for medical report generation.<n>SRL encourages clinical-correctness-guided learning beyond imitation of language style.<n>We evaluate Medical Report Generation with SRL on two datasets: IU X-Ray and MIMIC-CXR.
arXiv Detail & Related papers (2025-12-18T03:57:55Z)
Ontology-Based Concept Distillation for Radiology Report Retrieval and Labeling [10.504309161945065]
Most existing methods rely on comparing high-dimensional text embeddings from models like CLIP or CXR-BERT.<n>We propose a novel, ontology-driven alternative for comparing radiology report texts based on clinically grounded concepts from the Unified Medical Language System.<n>Our method extracts standardised medical entities from free-text reports using an enhanced pipeline built on RadGraph-XL and SapBERT.
arXiv Detail & Related papers (2025-08-27T14:20:50Z)
GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation [7.838068874909676]
Granular Explainable Multi-Agent Score (GEMA-Score) conducts both objective and subjective evaluation through a large language model-based multi-agent workflow.<n>GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset.
arXiv Detail & Related papers (2025-03-07T11:42:22Z)
RaTEScore: A Metric for Radiology Report Generation [59.37561810438641]
This paper introduces a novel, entity-aware metric, as Radiological Report (Text) Evaluation (RaTEScore) RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
arXiv Detail & Related papers (2024-06-24T17:49:28Z)
The current status of large language models in summarizing radiology report impressions [13.402769727597812]
The effectiveness of large language models (LLMs) in summarizing radiology report impressions remains unclear. Three types of radiology reports, i.e., CT, PET-CT, and Ultrasound reports, are collected from Peking University Cancer Hospital and Institute. We use the report findings to construct the zero-shot, one-shot, and three-shot prompts with complete example reports to generate the impressions.
arXiv Detail & Related papers (2024-06-04T09:23:30Z)
ChatRadio-Valuer: A Chat Large Language Model for Generalizable Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations. The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations. ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z)
Radiology-Llama2: Best-in-Class Large Language Model for Radiology [71.27700230067168]
This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-29T17:44:28Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
Knowledge Graph Construction and Its Application in Automatic Radiology Report Generation from Radiologist's Dictation [22.894248859405767]
This paper focuses on applications of NLP techniques like Information Extraction (IE) and domain-specific Knowledge Graph (KG) to automatically generate radiology reports from radiologist's dictation. We develop an information extraction pipeline that combines rule-based, pattern-based, and dictionary-based techniques with lexical-semantic features to extract entities and relations. We generate pathological descriptions evaluated using semantic similarity metrics, which shows 97% similarity with gold standard pathological descriptions.
arXiv Detail & Related papers (2022-06-13T16:46:54Z)
Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation [55.00308939833555]
The PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD) PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias. PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias.
arXiv Detail & Related papers (2021-06-13T11:10:02Z)
Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns. ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.