Related papers: CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis

CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis

URL: http://arxiv.org/abs/2508.12900v1
Date: Mon, 18 Aug 2025 12:58:21 GMT
Title: CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis
Authors: Jiayi Wang, Hadrien Reynaud, Franciskus Xaverius Erick, Bernhard Kainz,
Abstract summary: We introduce CTFlow, a latent flow matching transformer model conditioned on clinical reports.<n>We use the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports.<n>We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment.
Score: 7.57931364659531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative modelling of entire CT volumes conditioned on clinical reports has the potential to accelerate research through data augmentation, privacy-preserving synthesis and reducing regulator-constraints on patient data while preserving diagnostic signals. With the recent release of CT-RATE, a large-scale collection of 3D CT volumes paired with their respective clinical reports, training large text-conditioned CT volume generation models has become achievable. In this work, we introduce CTFlow, a 0.5B latent flow matching transformer model, conditioned on clinical reports. We leverage the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports. To generate consistent whole CT volumes while keeping the memory constraints tractable, we rely on a custom autoregressive approach, where the model predicts the first sequence of slices of the volume from text-only, and then relies on the previously generated sequence of slices and the text, to predict the following sequence. We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment, with FID, FVD, IS scores and CLIP score.

Related papers

CT-Flow: Orchestrating CT Interpretation Workflow with Model Context Protocol Servers [14.499713300688555]
Most existing approaches for 3D CT analysis largely rely on static, single-pass inference.<n>By leveraging the Model Context Protocol (MCP), CT-Flow shifts from closed-box inference to an open, tool-aware paradigm.<n>Built upon this, CT-Flow functions as a clinical orchestrator capable of decomposing complex natural language queries into automated tool-use sequences.
arXiv Detail & Related papers (2026-02-23T07:19:30Z)
BridgeSplat: Bidirectionally Coupled CT and Non-Rigid Gaussian Splatting for Deformable Intraoperative Surgical Navigation [69.14180476971602]
We introduce BridgeSplat, a novel approach for deformable surgical navigation.<n>Our method rigs 3D Gaussians to a CT mesh, enabling joint optimization of Gaussian parameters and mesh deformation.<n>We demonstrate BridgeSplat's effectiveness on visceral pig surgeries and synthetic data of a human liver under simulation.
arXiv Detail & Related papers (2025-09-23T01:09:36Z)
Radiology Report Conditional 3D CT Generation with Multi Encoder Latent diffusion Model [0.830525411228399]
Report2CT is a conditional diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports.<n>Report2CT generates anatomically consistent CT volumes with excellent visual quality and text image alignment.
arXiv Detail & Related papers (2025-09-18T09:32:23Z)
Recurrent Visual Feature Extraction and Stereo Attentions for CT Report Generation [18.113659670915474]
We propose a large language model (LLM) based CTRG method with recurrent visual feature extraction and stereo attentions for hierarchical feature modeling.<n>Specifically, we use a vision Transformer to recurrently process each slice in a CT volume, and employ a set of attentions over the encoded slices from different perspectives to obtain important visual information.<n>Experiment results and further analysis on the benchmark M3D-Cap dataset show that our method outperforms strong baseline models.
arXiv Detail & Related papers (2025-06-24T14:29:06Z)
Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining [0.8714814768600079]
We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme.<n>Our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text.
arXiv Detail & Related papers (2025-05-31T16:41:55Z)
3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans. Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z)
RadGenome-Chest CT: A Grounded Vision-Language Dataset for Chest CT Analysis [56.57177181778517]
RadGenome-Chest CT is a large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. We leverage the latest powerful universal segmentation and large language models to extend the original datasets.
arXiv Detail & Related papers (2024-04-25T17:11:37Z)
CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z)
GuideGen: A Text-Guided Framework for Full-torso Anatomy and CT Volume Generation [1.138481191622247]
GuideGen is a controllable framework that generates anatomical masks and corresponding CT volumes for the entire torso-from chest to pelvis-based on free-form text prompts. Our approach includes three core components: a text-conditional semantic synthesizer for creating realistic full-torso anatomies; a contrast-aware autoencoder for detailed, high-fidelity feature extraction across varying contrast levels; and a latent feature generator that ensures alignment between CT images, anatomical semantics and input prompts.
arXiv Detail & Related papers (2024-03-12T02:09:39Z)
GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes [2.410738584733268]
GenerateCT is the first approach to generating 3D medical imaging conditioned on free-form medical text prompts. We benchmarked GenerateCT against cutting-edge methods, demonstrating its superiority across all key metrics. GenerateCT enables the scaling of synthetic training datasets to arbitrary sizes.
arXiv Detail & Related papers (2023-05-25T13:16:39Z)
Incremental Cross-view Mutual Distillation for Self-supervised Medical CT Synthesis [88.39466012709205]
This paper builds a novel medical slice to increase the between-slice resolution. Considering that the ground-truth intermediate medical slices are always absent in clinical practice, we introduce the incremental cross-view mutual distillation strategy. Our method outperforms state-of-the-art algorithms by clear margins.
arXiv Detail & Related papers (2021-12-20T03:38:37Z)
Efficient Learning and Decoding of the Continuous-Time Hidden Markov Model for Disease Progression Modeling [119.50438407358862]
We present the first complete characterization of efficient EM-based learning methods for CT-HMM models. We show that EM-based learning consists of two challenges: the estimation of posterior state probabilities and the computation of end-state conditioned statistics. We demonstrate the use of CT-HMMs with more than 100 states to visualize and predict disease progression using a glaucoma dataset and an Alzheimer's disease dataset.
arXiv Detail & Related papers (2021-10-26T20:06:05Z)
CyTran: A Cycle-Consistent Transformer with Multi-Level Consistency for Non-Contrast to Contrast CT Translation [56.622832383316215]
We propose a novel approach to translate unpaired contrast computed tomography (CT) scans to non-contrast CT scans. Our approach is based on cycle-consistent generative adversarial convolutional transformers, for short, CyTran. Our empirical results show that CyTran outperforms all competing methods.
arXiv Detail & Related papers (2021-10-12T23:25:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.