CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
- URL: http://arxiv.org/abs/2602.14879v2
- Date: Thu, 19 Feb 2026 18:19:25 GMT
- Title: CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
- Authors: Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu,
- Abstract summary: CT-Bench is a first-of-its-kind benchmark dataset for lesion analysis on computed tomography (CT)<n>It consists of a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization.<n>We evaluate state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments.
- Score: 16.912129516334783
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
Related papers
- Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation [51.509572354327986]
This work introduces a novel two-stage (structure- and report-learning) framework tailored for Computed Tomography Report Generation (CTRG)<n>In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss.<n>In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption.
arXiv Detail & Related papers (2026-03-05T07:07:07Z) - Dual-Encoder Transformer-Based Multimodal Learning for Ischemic Stroke Lesion Segmentation Using Diffusion MRI [5.332404648315838]
We study ischemic stroke lesion segmentation using multimodal diffusion MRI from the ISLES 2022 dataset.<n>Several state-of-the-art convolutional and transformer-based architectures, including U-Net variants, Swin-UNet, and TransUNet, are benchmarked.<n>Results show that transformer-based models outperform convolutional baselines, and the proposed dual-encoder TransUNet achieves the best performance, reaching a Dice score of 85.4% on the test set.
arXiv Detail & Related papers (2025-12-23T15:24:31Z) - A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z) - A Novel Multi-branch ConvNeXt Architecture for Identifying Subtle Pathological Features in CT Scans [1.2461503242570642]
This paper introduces a novel multi-branch ConvNeXt architecture designed specifically for the nuanced challenges of medical image analysis.<n>The proposed model incorporates a rigorous end-to-end pipeline, from meticulous data preprocessing to augmentation to a disciplined two-phase training strategy.<n> Experimental results demonstrate a superior performance on the validation set, achieving a final ROC-AUC of 0.9937, a validation accuracy of 0.9757, and an F1-score of 0.9825 for COVID-19 cases.
arXiv Detail & Related papers (2025-10-10T08:00:46Z) - A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation [4.408787333571913]
We propose a framework for automatic renal CT report generation.<n>In Stage 1, a multi-task learning model detects structured clinical features from each 2D image.<n>In Stage 2, a vision-language model generates free-text reports conditioned on the image and the detected features.
arXiv Detail & Related papers (2025-06-30T07:45:02Z) - BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification [0.6840587119863303]
We introduce BRISC, a dataset designed for brain tumor segmentation and classification tasks, featuring high-resolution segmentation masks.<n>The dataset comprises 6,000 contrast-enhanced T1-weighted MRI scans, which were collated from multiple public datasets that lacked segmentation labels.<n>It includes three major tumor types, namely glioma, meningioma, and pituitary, as well as non-tumorous cases.
arXiv Detail & Related papers (2025-06-17T08:56:05Z) - RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [64.66825253356869]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.<n>We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.<n>We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - MvKeTR: Chest CT Report Generation with Multi-View Perception and Knowledge Enhancement [1.6355783973385114]
Multi-view perception knowledge-enhanced TansfoRmer (MvKeTR)<n>MVPA with view-aware attention is proposed to synthesize diagnostic information from multiple anatomical views effectively.<n>Cross-Modal Knowledge Enhancer (CMKE) is devised to retrieve the most similar reports based on the query volume.
arXiv Detail & Related papers (2024-11-27T12:58:23Z) - RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE.
We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z) - Visual Grounding of Whole Radiology Reports for 3D CT Images [12.071135670684013]
We present the first visual grounding framework designed for CT image and report pairs covering various body parts and diverse anomaly types.
Our framework combines two components of 1) anatomical segmentation of images, and 2) report structuring.
We constructed a large-scale dataset with region-description correspondence annotations for 10,410 studies of 7,321 unique patients.
arXiv Detail & Related papers (2023-12-08T02:09:17Z) - HGT: A Hierarchical GCN-Based Transformer for Multimodal Periprosthetic
Joint Infection Diagnosis Using CT Images and Text [0.0]
Prosthetic Joint Infection (PJI) is a prevalent and severe complication.
Currently, a unified diagnostic standard incorporating both computed tomography (CT) images and numerical text data for PJI remains unestablished.
This study introduces a diagnostic method, HGT, based on deep learning and multimodal techniques.
arXiv Detail & Related papers (2023-05-29T11:25:57Z) - InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal
Artifact Reduction in CT Images [53.4351366246531]
We construct a novel interpretable dual domain network, termed InDuDoNet+, into which CT imaging process is finely embedded.
We analyze the CT values among different tissues, and merge the prior observations into a prior network for our InDuDoNet+, which significantly improve its generalization performance.
arXiv Detail & Related papers (2021-12-23T15:52:37Z) - Malignancy Prediction and Lesion Identification from Clinical
Dermatological Images [65.1629311281062]
We consider machine-learning-based malignancy prediction and lesion identification from clinical dermatological images.
We first identify all lesions present in the image regardless of sub-type or likelihood of malignancy, then it estimates their likelihood of malignancy, and through aggregation, it also generates an image-level likelihood of malignancy.
arXiv Detail & Related papers (2021-04-02T20:52:05Z) - Synergistic Learning of Lung Lobe Segmentation and Hierarchical
Multi-Instance Classification for Automated Severity Assessment of COVID-19
in CT Images [61.862364277007934]
We propose a synergistic learning framework for automated severity assessment of COVID-19 in 3D CT images.
A multi-task deep network (called M$2$UNet) is then developed to assess the severity of COVID-19 patients.
Our M$2$UNet consists of a patch-level encoder, a segmentation sub-network for lung lobe segmentation, and a classification sub-network for severity assessment.
arXiv Detail & Related papers (2020-05-08T03:16:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.