Related papers: A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation

A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation

URL: http://arxiv.org/abs/2506.23584v2
Date: Thu, 16 Oct 2025 06:21:00 GMT
Title: A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation
Authors: Renjie Liang, Zhengkang Fan, Jinqian Pan, Chenkun Sun, Bruce Daniel Steinberg, Russell Terry, Jie Xu,
Abstract summary: We propose a framework for automatic renal CT report generation.<n>In Stage 1, a multi-task learning model detects structured clinical features from each 2D image.<n>In Stage 2, a vision-language model generates free-text reports conditioned on the image and the detected features.
Score: 4.408787333571913
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Objective Renal cancer is a common malignancy and a major cause of cancer-related deaths. Computed tomography (CT) is central to early detection, staging, and treatment planning. However, the growing CT workload increases radiologists' burden and risks incomplete documentation. Automatically generating accurate reports remains challenging because it requires integrating visual interpretation with clinical reasoning. Advances in artificial intelligence (AI), especially large language and vision-language models, offer potential to reduce workload and enhance diagnostic quality. Methods We propose a clinically informed, two-stage framework for automatic renal CT report generation. In Stage 1, a multi-task learning model detects structured clinical features from each 2D image. In Stage 2, a vision-language model generates free-text reports conditioned on the image and the detected features. To evaluate clinical fidelity, generated clinical features are extracted from the reports and compared with expert-annotated ground truth. Results Experiments on an expert-labeled dataset show that incorporating detected features improves both report quality and clinical accuracy. The model achieved an average AUC of 0.75 for key imaging features and a METEOR score of 0.33, demonstrating higher clinical consistency and fewer template-driven errors. Conclusion Linking structured feature detection with conditioned report generation provides a clinically grounded approach to integrate structured prediction and narrative drafting for renal CT reporting. This method enhances interpretability and clinical faithfulness, underscoring the value of domain-relevant evaluation metrics for medical AI development.

Related papers

AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z)
A Semantically Enhanced Generative Foundation Model Improves Pathological Image Synthesis [82.01597026329158]
We introduce a Correlation-Regulated Alignment Framework for Tissue Synthesis (CRAFTS) for pathology-specific text-to-image synthesis.<n>CRAFTS incorporates a novel alignment mechanism that suppresses semantic drift to ensure biological accuracy.<n>This model generates diverse pathological images spanning 30 cancer types, with quality rigorously validated by objective metrics and pathologist evaluations.
arXiv Detail & Related papers (2025-12-15T10:22:43Z)
An Explainable Hybrid AI Framework for Enhanced Tuberculosis and Symptom Detection [55.35661671061754]
Tuberculosis remains a critical global health issue, particularly in resource-limited and remote areas.<n>We propose a framework which enhances disease and symptom detection on chest X-rays by integrating two supervised heads and a self-supervised head.<n>Our model achieves an accuracy of 98.85% for distinguishing between COVID-19, tuberculosis, and normal cases, and a macro-F1 score of 90.09% for multilabel symptom detection.
arXiv Detail & Related papers (2025-10-21T17:18:55Z)
Ocular-Induced Abnormal Head Posture: Diagnosis and Missing Data Imputation [1.7061463565692456]
Ocular-induced abnormal head posture (AHP) is a compensatory mechanism that arises from ocular misalignment.<n>This study addresses both challenges through two complementary deep learning frameworks.<n>AHP-CADNet is a multi-level attention fusion framework for automated diagnosis.<n> curriculum learning-based imputation framework is designed to mitigate missing data.
arXiv Detail & Related papers (2025-10-07T07:51:59Z)
EMeRALDS: Electronic Medical Record Driven Automated Lung Nodule Detection and Classification in Thoracic CT Images [4.533165461983661]
Lung cancer is a leading cause of cancer-related mortality worldwide.<n>This study aims to develop a computer-aided diagnosis (CAD) system that leverages large vision-language models (VLMs)<n>The proposed approach demonstrated strong performance in zero-shot lung nodule analysis.
arXiv Detail & Related papers (2025-09-15T09:11:17Z)
Teaching AI Stepwise Diagnostic Reasoning with Report-Guided Chain-of-Thought Learning [11.537036709742345]
DiagCoT is a framework that applies supervised fine-tuning to general-purpose vision-language models (VLMs)<n>DiagCoT combines contrastive image-report tuning for domain alignment, chain-of-thought supervision to capture inferential logic, and reinforcement tuning with clinical reward signals to enhance factual accuracy and fluency.<n>It outperformed state-of-the-art models including LLaVA-Med and CXR-LLAVA on long-tailed diseases and external datasets.
arXiv Detail & Related papers (2025-09-08T08:01:26Z)
A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer [54.58205672910646]
RenalCLIP is a visual-language foundation model for characterization, diagnosis and prognosis of renal mass.<n>It achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer.
arXiv Detail & Related papers (2025-08-22T17:48:19Z)
PriorRG: Prior-Guided Contrastive Pre-training and Coarse-to-Fine Decoding for Chest X-ray Report Generation [12.860257420677122]
PriorRG is a chest X-ray report generation framework that emulates real-world clinical via a two-stage training pipeline.<n>In Stage 1, we introduce a prior-guided contrast pre-training scheme that leverages clinical context guidetemporal feature extraction.<n>In Stage 2, we integrate a prior-aware coarsetemporal decoding for generation that enhances prior knowledge with the vision encoders hidden states.
arXiv Detail & Related papers (2025-08-07T13:02:20Z)
OrthoInsight: Rib Fracture Diagnosis and Report Generation Based on Multi-Modal Large Models [0.49478969093606673]
We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation.<n>It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports.<n> evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28.
arXiv Detail & Related papers (2025-07-18T15:01:44Z)
Interactive Segmentation and Report Generation for CT Images [10.23242820828816]
We propose a novel interactive framework for 3D lesion morphology reporting.<n>We are the first to integrate the interactive segmentation and structured reports in 3D CT medical images.
arXiv Detail & Related papers (2025-03-05T09:18:27Z)
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans. Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z)
Structural Entities Extraction and Patient Indications Incorporation for Chest X-ray Report Generation [10.46031380503486]
We introduce a novel method, textbfStructural textbfEntities extraction and patient indications textbfIncorporation (SEI) for chest X-ray report generation. We employ a structural entities extraction (SEE) approach to eliminate presentation-style vocabulary in reports. We propose a cross-modal fusion network to integrate information from X-ray images, similar historical cases, and patient-specific indications.
arXiv Detail & Related papers (2024-05-23T01:29:47Z)
Dia-LLaMA: Towards Large Language Model-driven CT Report Generation [4.634780391920529]
We propose Dia-LLaMA, a framework to adapt the LLaMA2-7B for CT report generation by incorporating diagnostic information as guidance prompts. Considering the high dimension of CT, we leverage a pre-trained ViT3D with perceiver to extract the visual information. To tailor the LLM for report generation and emphasize abnormality, we extract additional diagnostic information by referring to a disease prototype memory bank.
arXiv Detail & Related papers (2024-03-25T03:02:51Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
Beyond Images: An Integrative Multi-modal Approach to Chest X-Ray Report Generation [47.250147322130545]
Image-to-text radiology report generation aims to automatically produce radiology reports that describe the findings in medical images. Most existing methods focus solely on the image data, disregarding the other patient information accessible to radiologists. We present a novel multi-modal deep neural network framework for generating chest X-rays reports by integrating structured patient data, such as vital signs and symptoms, alongside unstructured clinical notes.
arXiv Detail & Related papers (2023-11-18T14:37:53Z)
Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records. The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z)
Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [116.87918100031153]
We propose a Cross-modal clinical Graph Transformer (CGT) for ophthalmic report generation (ORG) CGT injects clinical relation triples into the visual features as prior knowledge to drive the decoding procedure. Experiments on the large-scale FFA-IR benchmark demonstrate that the proposed CGT is able to outperform previous benchmark methods.
arXiv Detail & Related papers (2022-06-04T13:16:30Z)
Factored Attention and Embedding for Unstructured-view Topic-related Ultrasound Report Generation [70.7778938191405]
We propose a novel factored attention and embedding model (termed FAE-Gen) for the unstructured-view topic-related ultrasound report generation. The proposed FAE-Gen mainly consists of two modules, i.e., view-guided factored attention and topic-oriented factored embedding, which capture the homogeneous and heterogeneous morphological characteristic across different views.
arXiv Detail & Related papers (2022-03-12T15:24:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.