SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation
- URL: http://arxiv.org/abs/2512.22878v1
- Date: Sun, 28 Dec 2025 11:00:05 GMT
- Title: SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation
- Authors: Hasan Faraz Khan, Noor Fatima, Muzammil Behzad,
- Abstract summary: We propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation.<n>SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture.
- Score: 0.30586855806896035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent integration of artificial intelligence into medical imaging has driven remarkable advances in automated organ segmentation. However, most existing 3D segmentation frameworks rely exclusively on visual learning from large annotated datasets restricting their adaptability to new domains and clinical tasks. The lack of semantic understanding in these models makes them ineffective in addressing flexible, user-defined segmentation objectives. To overcome these limitations, we propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation. The model employs a transformer-based visual encoder to extract volumetric features and integrates them with a compact text encoder via an efficient fusion mechanism. This design allows the system to understand natural-language prompts and correctly align semantic cues with their corresponding spatial structures in medical volumes, while producing accurate, context-aware segmentation results with low computational overhead. Extensive experiments on the BTCV dataset demonstrate that SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture. The model generalizes well to unseen data and offers significant efficiency gains compared to conventional transformer-based segmentation networks. Bridging visual perception with linguistic understanding, SwinTF3D establishes a practical and interpretable paradigm for interactive, text-driven 3D medical image segmentation, opening perspectives for more adaptive and resource-efficient solutions in clinical imaging.
Related papers
- MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation [11.762545584252052]
We propose a unified 3D medical multimodal model that supports report generation, VQA, and multi-paradigm segmentation.<n>MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging.<n>Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks.
arXiv Detail & Related papers (2026-01-14T21:21:00Z) - TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation [56.09179939570486]
We propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations.<n>TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
arXiv Detail & Related papers (2025-12-24T12:06:26Z) - VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine [11.993301266706139]
We propose a vision-language pre-training framework, termed as textbfVELVET-Med, specifically designed for limited volumetric data such as 3D CT and associated radiology reports.<n>Our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives.<n>The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks.
arXiv Detail & Related papers (2025-08-16T17:08:43Z) - Text-driven Multiplanar Visual Interaction for Semi-supervised Medical Image Segmentation [48.76848912120607]
Semi-supervised medical image segmentation is a crucial technique for alleviating the high cost of data annotation.<n>We propose a novel text-driven multiplanar visual interaction framework for semi-supervised medical image segmentation (termed Text-SemiSeg)<n>Our framework consists of three main modules: Text-enhanced Multiplanar Representation (TMR), Category-aware Semantic Alignment (CSA), and Dynamic Cognitive Augmentation (DCA)
arXiv Detail & Related papers (2025-07-16T16:29:30Z) - TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation [22.62310549476759]
3D medical image segmentation is vital for clinical diagnosis and treatment.<n>Traditional single-modality networks, such as CNNs and Transformers, are often limited by computational inefficiency and constrained contextual modeling.<n>We introduce a novel multimodal framework that leverages Mamba and Kolmogorov-Arnold Networks (KAN) as an efficient backbone for long-sequence modeling.
arXiv Detail & Related papers (2025-05-24T05:41:55Z) - TextDiffSeg: Text-guided Latent Diffusion Model for 3d Medical Images Segmentation [0.0]
Text-guided diffusion model framework, TextDiffSeg, integrates 3D volumetric data with natural language descriptions.<n>By enhancing the model's ability to recognize complex anatomical structures, TextDiffSeg incorporates innovative label embedding techniques.<n> Experimental results demonstrate that TextDiffSeg consistently outperforms existing methods in segmentation tasks involving kidney and pancreas tumors.
arXiv Detail & Related papers (2025-04-16T07:17:36Z) - A Novel Convolutional-Free Method for 3D Medical Imaging Segmentation [0.0]
Convolutional neural networks (CNNs) have dominated the field, achieving significant success in 3D medical image segmentation.<n>Recent transformer-based models, such as TransUNet and nnFormer, have demonstrated promise in addressing these limitations.<n>This paper introduces a novel, fully convolutional-free model based on transformer architecture and self-attention mechanisms.
arXiv Detail & Related papers (2025-02-08T00:52:45Z) - CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios [53.94122089629544]
We introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning.
Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages.
arXiv Detail & Related papers (2024-04-23T17:59:01Z) - Enhancing Weakly Supervised 3D Medical Image Segmentation through Probabilistic-aware Learning [47.700298779672366]
3D medical image segmentation is a challenging task with crucial implications for disease diagnosis and treatment planning.<n>Recent advances in deep learning have significantly enhanced fully supervised medical image segmentation.<n>We propose a novel probabilistic-aware weakly supervised learning pipeline, specifically designed for 3D medical imaging.
arXiv Detail & Related papers (2024-03-05T00:46:53Z) - VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder [56.59814904526965]
This paper introduces a pioneering 3D encoder designed for text-to-3D generation.
A lightweight network is developed to efficiently acquire feature volumes from multi-view images.
The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net.
arXiv Detail & Related papers (2023-12-18T18:59:05Z) - Disruptive Autoencoders: Leveraging Low-level features for 3D Medical
Image Pre-training [51.16994853817024]
This work focuses on designing an effective pre-training framework for 3D radiology images.
We introduce Disruptive Autoencoders, a pre-training framework that attempts to reconstruct the original image from disruptions created by a combination of local masking and low-level perturbations.
The proposed pre-training framework is tested across multiple downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-31T17:59:42Z) - Focused Decoding Enables 3D Anatomical Detection by Transformers [64.36530874341666]
We propose a novel Detection Transformer for 3D anatomical structure detection, dubbed Focused Decoder.
Focused Decoder leverages information from an anatomical region atlas to simultaneously deploy query anchors and restrict the cross-attention's field of view.
We evaluate our proposed approach on two publicly available CT datasets and demonstrate that Focused Decoder not only provides strong detection results and thus alleviates the need for a vast amount of annotated data but also exhibits exceptional and highly intuitive explainability of results via attention weights.
arXiv Detail & Related papers (2022-07-21T22:17:21Z) - Dynamic Linear Transformer for 3D Biomedical Image Segmentation [2.440109381823186]
Transformer-based neural networks have surpassed promising performance on many biomedical image segmentation tasks.
Main challenge for 3D transformer-based segmentation methods is the quadratic complexity introduced by the self-attention mechanism.
We propose a novel transformer architecture for 3D medical image segmentation using an encoder-decoder style architecture with linear complexity.
arXiv Detail & Related papers (2022-06-01T21:15:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.