Related papers: Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

URL: http://arxiv.org/abs/2506.00633v1
Date: Sat, 31 May 2025 16:41:55 GMT
Title: Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi,
Abstract summary: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme.<n>Our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text.
Score: 0.8714814768600079
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric Computed Tomography (CT) remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation.

Related papers

Radiology Report Conditional 3D CT Generation with Multi Encoder Latent diffusion Model [0.830525411228399]
Report2CT is a conditional diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports.<n>Report2CT generates anatomically consistent CT volumes with excellent visual quality and text image alignment.
arXiv Detail & Related papers (2025-09-18T09:32:23Z)
Accelerating 3D Photoacoustic Computed Tomography with End-to-End Physics-Aware Neural Operators [74.65171736966131]
Photoacoustic computed tomography (PACT) combines optical contrast with ultrasonic resolution, achieving deep-tissue imaging beyond the optical diffusion limit.<n>Current implementations require dense transducer arrays and prolonged acquisition times, limiting clinical translation.<n>We introduce Pano, an end-to-end physics-aware model that directly learns the inverse acoustic mapping from sensor measurements to volumetric reconstructions.
arXiv Detail & Related papers (2025-09-11T23:12:55Z)
CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis [7.57931364659531]
We introduce CTFlow, a latent flow matching transformer model conditioned on clinical reports.<n>We use the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports.<n>We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment.
arXiv Detail & Related papers (2025-08-18T12:58:21Z)
TRACE: Temporally Reliable Anatomically-Conditioned 3D CT Generation with Enhanced Efficiency [40.82927972746919]
TRACE is a framework that generates 3D medical images with temporal alignment.<n>An overlapping-frame frame pairs pairs into a flexible length sequence, reconstructed into atemporally and anatomically aligned 3D volume.
arXiv Detail & Related papers (2025-07-01T14:35:39Z)
TextDiffSeg: Text-guided Latent Diffusion Model for 3d Medical Images Segmentation [0.0]
Text-guided diffusion model framework, TextDiffSeg, integrates 3D volumetric data with natural language descriptions.<n>By enhancing the model's ability to recognize complex anatomical structures, TextDiffSeg incorporates innovative label embedding techniques.<n> Experimental results demonstrate that TextDiffSeg consistently outperforms existing methods in segmentation tasks involving kidney and pancreas tumors.
arXiv Detail & Related papers (2025-04-16T07:17:36Z)
Abnormality-Driven Representation Learning for Radiology Imaging [0.8321462983924758]
We introduce lesion-enhanced contrastive learning (LeCL), a novel approach to obtain visual representations driven by abnormalities in 2D axial slices across different locations of the CT scans. We evaluate our approach across three clinical tasks: tumor lesion location, lung disease detection, and patient staging, benchmarking against four state-of-the-art foundation models.
arXiv Detail & Related papers (2024-11-25T13:53:26Z)
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans. Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z)
Autoregressive Sequence Modeling for 3D Medical Image Representation [48.706230961589924]
We introduce a pioneering method for learning 3D medical image representations through an autoregressive sequence pre-training framework.<n>Our approach various 3D medical images based on spatial, contrast, and semantic correlations, treating them as interconnected visual tokens within a token sequence.
arXiv Detail & Related papers (2024-09-13T10:19:10Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images. Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask. By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z)
GuideGen: A Text-Guided Framework for Full-torso Anatomy and CT Volume Generation [1.138481191622247]
GuideGen is a controllable framework that generates anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis-based on free-form text prompts. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; a contrast-aware autoencoder for detailed, high-fidelity feature extraction across varying contrast levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts.
arXiv Detail & Related papers (2024-03-12T02:09:39Z)
MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT Images [22.455833806331384]
This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information. Current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information.
arXiv Detail & Related papers (2023-10-05T14:16:22Z)
Disruptive Autoencoders: Leveraging Low-level features for 3D Medical Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images. We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations. The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z)
Medical Image Captioning via Generative Pretrained Transformers [57.308920993032274]
We combine two language models, the Show-Attend-Tell and the GPT-3, to generate comprehensive and descriptive radiology records. The proposed model is tested on two medical datasets, the Open-I, MIMIC-CXR, and the general-purpose MS-COCO.
arXiv Detail & Related papers (2022-09-28T10:27:10Z)
Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis [88.39466012709205]
This paper builds a novel medical slice to increase the between-slice resolution. Considering that the ground-truth intermediate medical slices are always absent in clinical practice, we introduce the incremental cross-view mutual distillation strategy. Our method outperforms state-of-the-art algorithms by clear margins.
arXiv Detail & Related papers (2021-12-20T03:38:37Z)
Hierarchical Amortized Training for Memory-efficient High Resolution 3D GAN [52.851990439671475]
We propose a novel end-to-end GAN architecture that can generate high-resolution 3D images. We achieve this goal by using different configurations between training and inference. Experiments on 3D thorax CT and brain MRI demonstrate that our approach outperforms state of the art in image generation.
arXiv Detail & Related papers (2020-08-05T02:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.