Text2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model
- URL: http://arxiv.org/abs/2505.04522v1
- Date: Wed, 07 May 2025 15:53:56 GMT
- Title: Text2CT: Towards 3D CT Volume Generation from Free-text Descriptions Using Diffusion Model
- Authors: Pengfei Guo, Can Zhao, Dong Yang, Yufan He, Vishwesh Nath, Ziyue Xu, Pedro R. A. S. Bassi, Zongwei Zhou, Benjamin D. Simon, Stephanie Anne Harmon, Baris Turkbey, Daguang Xu,
- Abstract summary: We introduce Text2CT, a novel approach for synthesizing 3D CT volumes from textual descriptions.<n>The proposed framework encodes medical text into latent representations and decodes them into high-resolution 3D CT scans.
- Score: 17.778156861799207
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating 3D CT volumes from descriptive free-text inputs presents a transformative opportunity in diagnostics and research. In this paper, we introduce Text2CT, a novel approach for synthesizing 3D CT volumes from textual descriptions using the diffusion model. Unlike previous methods that rely on fixed-format text input, Text2CT employs a novel prompt formulation that enables generation from diverse, free-text descriptions. The proposed framework encodes medical text into latent representations and decodes them into high-resolution 3D CT scans, effectively bridging the gap between semantic text inputs and detailed volumetric representations in a unified 3D framework. Our method demonstrates superior performance in preserving anatomical fidelity and capturing intricate structures as described in the input text. Extensive evaluations show that our approach achieves state-of-the-art results, offering promising potential applications in diagnostics, and data augmentation.
Related papers
- Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining [0.8714814768600079]
We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme.<n>Our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text.
arXiv Detail & Related papers (2025-05-31T16:41:55Z) - TextDiffSeg: Text-guided Latent Diffusion Model for 3d Medical Images Segmentation [0.0]
Text-guided diffusion model framework, TextDiffSeg, integrates 3D volumetric data with natural language descriptions.<n>By enhancing the model's ability to recognize complex anatomical structures, TextDiffSeg incorporates innovative label embedding techniques.<n> Experimental results demonstrate that TextDiffSeg consistently outperforms existing methods in segmentation tasks involving kidney and pancreas tumors.
arXiv Detail & Related papers (2025-04-16T07:17:36Z) - Explaining 3D Computed Tomography Classifiers with Counterfactuals [5.782952470371709]
We extend the Latent Shift counterfactual generation method from 2D applications to explain 3D computed tomography (CT) scans.<n>We implement a slice-based autoencoder and gradient blocking except for specific chunks of slices.<n>Our approach is both memory-efficient and effective for generating interpretable counterfactuals in high-resolution 3D medical imaging.
arXiv Detail & Related papers (2025-02-11T00:44:20Z) - Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models [57.37244894146089]
We propose Diff2Scene, which leverages frozen representations from text-image generative models, along with salient-aware and geometric-aware masks, for open-vocabulary 3D semantic segmentation and visual grounding tasks.
We show that it outperforms competitive baselines and achieves significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T16:20:56Z) - Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts.
We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline.
We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z) - GuideGen: A Text-Guided Framework for Full-torso Anatomy and CT Volume Generation [1.138481191622247]
GuideGen is a controllable framework that generates anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis-based on free-form text prompts.
Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; a contrast-aware autoencoder for detailed, high-fidelity feature extraction across varying contrast levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts.
arXiv Detail & Related papers (2024-03-12T02:09:39Z) - Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder [56.59814904526965]
This paper introduces a pioneering 3D encoder designed for text-to-3D generation.
A lightweight network is developed to efficiently acquire feature volumes from multi-view images.
The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net.
arXiv Detail & Related papers (2023-12-18T18:59:05Z) - MedSyn: Text-guided Anatomy-aware Synthesis of High-Fidelity 3D CT Images [22.455833806331384]
This paper introduces an innovative methodology for producing high-quality 3D lung CT images guided by textual information.
Current state-of-the-art approaches are limited to low-resolution outputs and underutilize radiology reports' abundant information.
arXiv Detail & Related papers (2023-10-05T14:16:22Z) - IT3D: Improved Text-to-3D Generation with Explicit View Synthesis [71.68595192524843]
This study presents a novel strategy that leverages explicitly synthesized multi-view images to address these issues.
Our approach involves the utilization of image-to-image pipelines, empowered by LDMs, to generate posed high-quality images.
For the incorporated discriminator, the synthesized multi-view images are considered real data, while the renderings of the optimized 3D models function as fake data.
arXiv Detail & Related papers (2023-08-22T14:39:17Z) - TAPS3D: Text-Guided 3D Textured Shape Generation from Pseudo Supervision [114.56048848216254]
We present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions.
Based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates.
Our constructed captions provide high-level semantic supervision for generated 3D shapes.
arXiv Detail & Related papers (2023-03-23T13:53:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.