CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
- URL: http://arxiv.org/abs/2505.18087v1
- Date: Fri, 23 May 2025 16:44:21 GMT
- Title: CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays
- Authors: Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, Edward Choi,
- Abstract summary: We present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset.<n>CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays.<n> CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps.
- Score: 9.051771615770075
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent progress in Large Vision-Language Models (LVLMs) has enabled promising applications in medical tasks, such as report generation and visual question answering. However, existing benchmarks focus mainly on the final diagnostic answer, offering limited insight into whether models engage in clinically meaningful reasoning. To address this, we present CheXStruct and CXReasonBench, a structured pipeline and benchmark built on the publicly available MIMIC-CXR-JPG dataset. CheXStruct automatically derives a sequence of intermediate reasoning steps directly from chest X-rays, such as segmenting anatomical regions, deriving anatomical landmarks and diagnostic measurements, computing diagnostic indices, and applying clinical thresholds. CXReasonBench leverages this pipeline to evaluate whether models can perform clinically valid reasoning steps and to what extent they can learn from structured guidance, enabling fine-grained and transparent assessment of diagnostic reasoning. The benchmark comprises 18,988 QA pairs across 12 diagnostic tasks and 1,200 cases, each paired with up to 4 visual inputs, and supports multi-path, multi-stage evaluation including visual grounding via anatomical region selection and diagnostic measurements. Even the strongest of 10 evaluated LVLMs struggle with structured reasoning and generalization, often failing to link abstract knowledge with anatomically grounded visual interpretation. The code is available at https://github.com/ttumyche/CXReasonBench
Related papers
- RadFabric: Agentic AI System with Reasoning Capability for Radiology [61.25593938175618]
RadFabric is a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation.<n>System employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses.
arXiv Detail & Related papers (2025-06-17T03:10:33Z) - Encoding of Demographic and Anatomical Information in Chest X-Ray-based Severe Left Ventricular Hypertrophy Classifiers [36.052936348670634]
We introduce a direct classification framework that predicts severe left ventricular hypertrophy from chest X-rays.<n>Our approach achieves high AUROC and AUPRC, and employs Mutual Information Neural Estimation to quantify feature expressivity.
arXiv Detail & Related papers (2025-05-31T13:30:04Z) - DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis? [1.1094764204428438]
We propose DrVD-Bench, the first benchmark for clinical visual reasoning.<n>DrVD-Bench consists of three modules: Visual Evidence, Reasoning Trajectory Assessment, and Report Generation Evaluation.<n>Our benchmark covers 20 task types, 17 diagnostic categories, and five imaging modalities-CT, MRI, ultrasound, radiography, and pathology.
arXiv Detail & Related papers (2025-05-30T03:33:25Z) - Interpreting Chest X-rays Like a Radiologist: A Benchmark with Clinical Reasoning [18.15610003617933]
We present CXRTrek, a new multi-stage visual question answering (VQA) dataset for chest X-ray (CXR) interpretation.<n>The dataset is designed to explicitly simulate the diagnostic reasoning process employed by radiologists in real-world clinical settings.<n>We propose a new vision-language large model (VLLM), CXRTrekNet, specifically designed to incorporate the clinical reasoning flow into the framework.
arXiv Detail & Related papers (2025-05-29T06:30:40Z) - MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports [49.00805568780791]
We introduce MedCaseReasoning, the first open-access dataset for evaluating Large Language Models (LLMs) on their ability to align with clinician-authored diagnostic reasoning.<n>The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements.<n>We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning.
arXiv Detail & Related papers (2025-05-16T22:34:36Z) - CheXLearner: Text-Guided Fine-Grained Representation Learning for Progression Detection [14.414457048968439]
We present CheXLearner, the first end-to-end framework that unifies anatomical region detection, structure alignment, and semantic guidance.<n>Our proposed Med-Manifold Alignment Module (Med-MAM) leverages hyperbolic geometry to robustly align anatomical structures.<n>Our model attains a 91.52% average AUC score in downstream disease classification, validating its superior feature representation.
arXiv Detail & Related papers (2025-05-11T08:51:38Z) - Aligning Human Knowledge with Visual Concepts Towards Explainable Medical Image Classification [8.382606243533942]
We introduce a simple yet effective framework, Explicd, towards Explainable language-informed criteria-based diagnosis.
By leveraging a pretrained vision-language model, Explicd injects these criteria into the embedding space as knowledge anchors.
The final diagnostic outcome is determined based on the similarity scores between the encoded visual concepts and the textual criteria embeddings.
arXiv Detail & Related papers (2024-06-08T23:23:28Z) - Prompt-Guided Generation of Structured Chest X-Ray Report Using a Pre-trained LLM [5.766695041882696]
We introduce a prompt-guided approach to generate structured chest X-ray reports using a pre-trained large language model (LLM)
First, we identify anatomical regions in chest X-rays to generate focused sentences that center on key visual elements.
We also convert the detected anatomy into textual prompts conveying anatomical comprehension to the LLM.
arXiv Detail & Related papers (2024-04-17T09:45:43Z) - Towards the Identifiability and Explainability for Personalized Learner
Modeling: An Inductive Paradigm [36.60917255464867]
We propose an identifiable cognitive diagnosis framework (ID-CDF) based on a novel response-proficiency-response paradigm inspired by encoder-decoder models.
We show that ID-CDF can effectively address the problems without loss of diagnosis preciseness.
arXiv Detail & Related papers (2023-09-01T07:18:02Z) - Xplainer: From X-Ray Observations to Explainable Zero-Shot Diagnosis [36.45569352490318]
We introduce Xplainer, a framework for explainable zero-shot diagnosis in the clinical setting.
Xplainer adapts the classification-by-description approach of contrastive vision-language models to the multi-label medical diagnosis task.
Our results suggest that Xplainer provides a more detailed understanding of the decision-making process.
arXiv Detail & Related papers (2023-03-23T16:07:31Z) - Improving Classification Model Performance on Chest X-Rays through Lung
Segmentation [63.45024974079371]
We propose a deep learning approach to enhance abnormal chest x-ray (CXR) identification performance through segmentations.
Our approach is designed in a cascaded manner and incorporates two modules: a deep neural network with criss-cross attention modules (XLSor) for localizing lung region in CXR images and a CXR classification model with a backbone of a self-supervised momentum contrast (MoCo) model pre-trained on large-scale CXR data sets.
arXiv Detail & Related papers (2022-02-22T15:24:06Z) - BI-RADS-Net: An Explainable Multitask Learning Approach for Cancer
Diagnosis in Breast Ultrasound Images [69.41441138140895]
This paper introduces BI-RADS-Net, a novel explainable deep learning approach for cancer detection in breast ultrasound images.
The proposed approach incorporates tasks for explaining and classifying breast tumors, by learning feature representations relevant to clinical diagnosis.
Explanations of the predictions (benign or malignant) are provided in terms of morphological features that are used by clinicians for diagnosis and reporting in medical practice.
arXiv Detail & Related papers (2021-10-05T19:14:46Z) - Act Like a Radiologist: Towards Reliable Multi-view Correspondence
Reasoning for Mammogram Mass Detection [49.14070210387509]
We propose an Anatomy-aware Graph convolutional Network (AGN) for mammogram mass detection.
AGN is tailored for mammogram mass detection and endows existing detection methods with multi-view reasoning ability.
Experiments on two standard benchmarks reveal that AGN significantly exceeds the state-of-the-art performance.
arXiv Detail & Related papers (2021-05-21T06:48:34Z) - Weakly supervised multiple instance learning histopathological tumor
segmentation [51.085268272912415]
We propose a weakly supervised framework for whole slide imaging segmentation.
We exploit a multiple instance learning scheme for training models.
The proposed framework has been evaluated on multi-locations and multi-centric public data from The Cancer Genome Atlas and the PatchCamelyon dataset.
arXiv Detail & Related papers (2020-04-10T13:12:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.