Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging
- URL: http://arxiv.org/abs/2510.20639v1
- Date: Thu, 23 Oct 2025 15:13:13 GMT
- Title: Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging
- Authors: Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, Bjoern Menze,
- Abstract summary: We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference.<n>A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement.<n>It improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation.<n>It reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis.
- Score: 19.44554736205812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512*512*241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D
Related papers
- Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models [5.020980730631682]
Existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension.<n>Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions.<n>We propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module.
arXiv Detail & Related papers (2025-09-11T00:12:59Z) - Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining [1.447808799346751]
We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme.<n>Our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text.
arXiv Detail & Related papers (2025-05-31T16:41:55Z) - Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.<n>UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - Explaining 3D Computed Tomography Classifiers with Counterfactuals [5.782952470371709]
We extend the Latent Shift counterfactual generation method from 2D applications to explain 3D computed tomography (CT) scans.<n>We implement a slice-based autoencoder and gradient blocking except for specific chunks of slices.<n>Our approach is both memory-efficient and effective for generating interpretable counterfactuals in high-resolution 3D medical imaging.
arXiv Detail & Related papers (2025-02-11T00:44:20Z) - DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions [6.464464511743737]
We introduce DCFormer, an efficient 3D image encoder that factorizes 3D convolutions into three parallel 1D convolutions along the depth, height, and width dimensions.<n>In zero-shot and fine-tuned detection of 18 pathologies, DCFormer consistently outperforms state-of-the-art 3D vision encoders.
arXiv Detail & Related papers (2025-02-07T17:10:22Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - Generative Enhancement for 3D Medical Images [74.17066529847546]
We propose GEM-3D, a novel generative approach to the synthesis of 3D medical images.
Our method begins with a 2D slice, noted as the informed slice to serve the patient prior, and propagates the generation process using a 3D segmentation mask.
By decomposing the 3D medical images into masks and patient prior information, GEM-3D offers a flexible yet effective solution for generating versatile 3D images.
arXiv Detail & Related papers (2024-03-19T15:57:04Z) - Med3DInsight: Enhancing 3D Medical Image Understanding with 2D
Multi-Modal Large Language Models [1.64647940449869]
Existing 3D convolution and transformer-based methods have limited semantic understanding of an image volume.
We propose Med3DInsight, which marries existing 3D image encoders with 2D MLLMs and bridges them via a Plane-Slice-Aware Transformer (PSAT) module.
arXiv Detail & Related papers (2024-03-08T08:15:53Z) - Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks.
These 2D models have been surpassed by 3D models on 3D computer vision benchmarks.
We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - Automated Model Design and Benchmarking of 3D Deep Learning Models for
COVID-19 Detection with Chest CT Scans [72.04652116817238]
We propose a differentiable neural architecture search (DNAS) framework to automatically search for the 3D DL models for 3D chest CT scans classification.
We also exploit the Class Activation Mapping (CAM) technique on our models to provide the interpretability of the results.
arXiv Detail & Related papers (2021-01-14T03:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.