Related papers: Radiology Report Conditional 3D CT Generation with Multi Encoder Latent diffusion Model

Radiology Report Conditional 3D CT Generation with Multi Encoder Latent diffusion Model

URL: http://arxiv.org/abs/2509.14780v1
Date: Thu, 18 Sep 2025 09:32:23 GMT
Title: Radiology Report Conditional 3D CT Generation with Multi Encoder Latent diffusion Model
Authors: Sina Amirrajab, Zohaib Salahuddin, Sheng Kuang, Henry C. Woodruff, Philippe Lambin,
Abstract summary: Report2CT is a conditional diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports.<n>Report2CT generates anatomically consistent CT volumes with excellent visual quality and text image alignment.
Score: 0.830525411228399
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text to image latent diffusion models have recently advanced medical image synthesis, but applications to 3D CT generation remain limited. Existing approaches rely on simplified prompts, neglecting the rich semantic detail in full radiology reports, which reduces text image alignment and clinical fidelity. We propose Report2CT, a radiology report conditional latent diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports, incorporating both findings and impression sections using multiple text encoder. Report2CT integrates three pretrained medical text encoders (BiomedVLP CXR BERT, MedEmbed, and ClinicalBERT) to capture nuanced clinical context. Radiology reports and voxel spacing information condition a 3D latent diffusion model trained on 20000 CT volumes from the CT RATE dataset. Model performance was evaluated using Frechet Inception Distance (FID) for real synthetic distributional similarity and CLIP based metrics for semantic alignment, with additional qualitative and quantitative comparisons against GenerateCT model. Report2CT generated anatomically consistent CT volumes with excellent visual quality and text image alignment. Multi encoder conditioning improved CLIP scores, indicating stronger preservation of fine grained clinical details in the free text radiology reports. Classifier free guidance further enhanced alignment with only a minor trade off in FID. We ranked first in the VLM3D Challenge at MICCAI 2025 on Text Conditional CT Generation and achieved state of the art performance across all evaluation metrics. By leveraging complete radiology reports and multi encoder text conditioning, Report2CT advances 3D CT synthesis, producing clinically faithful and high quality synthetic data.

Related papers

Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation [51.509572354327986]
This work introduces a novel two-stage (structure- and report-learning) framework tailored for Computed Tomography Report Generation (CTRG)<n>In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss.<n>In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption.
arXiv Detail & Related papers (2026-03-05T07:07:07Z)
CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis [7.57931364659531]
We introduce CTFlow, a latent flow matching transformer model conditioned on clinical reports.<n>We use the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports.<n>We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment.
arXiv Detail & Related papers (2025-08-18T12:58:21Z)
A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation [2.988064755409503]
We propose a two-stage framework for generating renal radiology reports from 2D CT slices.<n>First, we extract structured abnormality features using a multi-task learning model trained to identify lesion attributes.<n>These extracted features are combined with the corresponding CT image and fed into a fine-tuned vision-language model to generate natural language report sentences.
arXiv Detail & Related papers (2025-06-30T07:45:02Z)
Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining [0.8714814768600079]
We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme.<n>Our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text.
arXiv Detail & Related papers (2025-05-31T16:41:55Z)
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans. Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z)
RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z)
Developing Generalist Foundation Models from a Multimodal Dataset for 3D Computed Tomography [21.25960416231541]
We introduce CT-RATE, a dataset that pairs 3D medical images with corresponding textual reports.<n>We develop CT-CLIP, a CT-focused contrastive language-image pretraining framework.<n>We create CT-CHAT, a vision-language chat model for 3D chest CT volumes.
arXiv Detail & Related papers (2024-03-26T16:19:56Z)
GuideGen: A Text-Guided Framework for Full-torso Anatomy and CT Volume Generation [1.138481191622247]
GuideGen is a controllable framework that generates anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis-based on free-form text prompts. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; a contrast-aware autoencoder for detailed, high-fidelity feature extraction across varying contrast levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts.
arXiv Detail & Related papers (2024-03-12T02:09:39Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
PathLDM: Text conditioned Latent Diffusion Model for Histopathology [62.970593674481414]
We introduce PathLDM, the first text-conditioned Latent Diffusion Model tailored for generating high-quality histopathology images. Our approach fuses image and textual data to enhance the generation process. We achieved a SoTA FID score of 7.64 for text-to-image generation on the TCGA-BRCA dataset, significantly outperforming the closest text-conditioned competitor with FID 30.1.
arXiv Detail & Related papers (2023-09-01T22:08:32Z)
Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records. The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.