A Brain Tumor Segmentation Method Based on CLIP and 3D U-Net with Cross-Modal Semantic Guidance and Multi-Level Feature Fusion
- URL: http://arxiv.org/abs/2507.09966v2
- Date: Thu, 17 Jul 2025 09:49:45 GMT
- Title: A Brain Tumor Segmentation Method Based on CLIP and 3D U-Net with Cross-Modal Semantic Guidance and Multi-Level Feature Fusion
- Authors: Mingda Zhang,
- Abstract summary: This research presents a multi-level fusion architecture that integrates pixel-level, feature-level, and semantic-level information.<n>The proposed model achieves an overall Dice coefficient of 0.8567, representing a 4.8% improvement compared to traditional 3D U-Net.
- Score: 7.784274233237623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Precise segmentation of brain tumors from magnetic resonance imaging (MRI) is essential for neuro-oncology diagnosis and treatment planning. Despite advances in deep learning methods, automatic segmentation remains challenging due to tumor morphological heterogeneity and complex three-dimensional spatial relationships. Current techniques primarily rely on visual features extracted from MRI sequences while underutilizing semantic knowledge embedded in medical reports. This research presents a multi-level fusion architecture that integrates pixel-level, feature-level, and semantic-level information, facilitating comprehensive processing from low-level data to high-level concepts. The semantic-level fusion pathway combines the semantic understanding capabilities of Contrastive Language-Image Pre-training (CLIP) models with the spatial feature extraction advantages of 3D U-Net through three mechanisms: 3D-2D semantic bridging, cross-modal semantic guidance, and semantic-based attention mechanisms. Experimental validation on the BraTS 2020 dataset demonstrates that the proposed model achieves an overall Dice coefficient of 0.8567, representing a 4.8% improvement compared to traditional 3D U-Net, with a 7.3% Dice coefficient increase in the clinically important enhancing tumor (ET) region.
Related papers
- Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor Segmentation (GMLN-BTS) in Edge Iterative MRI Lesion Localization System (EdgeIMLocSys) [6.451534509235736]
We propose the Edge Iterative MRI Lesion Localization System (EdgeIMLocSys), which integrates Continuous Learning from Human Feedback.<n>Central to this system is the Graph-based Multi-Modal Interaction Lightweight Network for Brain Tumor (GMLN-BTS)<n>Our proposed GMLN-BTS model achieves a Dice score of 85.1% on the BraTS 2017 dataset with only 4.58 million parameters, representing a 98% reduction compared to mainstream 3D Transformer models.
arXiv Detail & Related papers (2025-07-14T07:29:49Z) - A Multi-Modal Fusion Framework for Brain Tumor Segmentation Based on 3D Spatial-Language-Vision Integration and Bidirectional Interactive Attention Mechanism [6.589206192038366]
The framework was evaluated on BraTS 2020 dataset comprising 369 multi-institutional MRI scans.<n>The proposed method achieved average Dice coefficient of 0.8505 and 95% Hausdorff distance of 2.8256mm across enhancing tumor, tumor core, and whole tumor regions.
arXiv Detail & Related papers (2025-07-11T13:21:56Z) - Ensemble Learning and 3D Pix2Pix for Comprehensive Brain Tumor Analysis in Multimodal MRI [2.104687387907779]
This study presents an integrated approach leveraging the strengths of ensemble learning with hybrid transformer models and convolutional neural networks (CNNs)<n>Our methodology combines robust tumor segmentation capabilities, utilizing axial attention and transformer encoders for enhanced spatial relationship modeling.<n>The results demonstrate outstanding performance, evidenced by quantitative evaluations such as the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD95) for segmentation, and Structural Similarity Index Measure (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean-Square Error (MSE) for inpainting.
arXiv Detail & Related papers (2024-12-16T15:10:53Z) - 3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models [51.855377054763345]
This paper introduces 3D-CT-GPT, a Visual Question Answering (VQA)-based medical visual language model for generating radiology reports from 3D CT scans.
Experiments on both public and private datasets demonstrate that 3D-CT-GPT significantly outperforms existing methods in terms of report accuracy and quality.
arXiv Detail & Related papers (2024-09-28T12:31:07Z) - Learning Brain Tumor Representation in 3D High-Resolution MR Images via Interpretable State Space Models [42.55786269051626]
We propose a novel state-space-model (SSM)-based masked autoencoder which scales ViT-like models to handle high-resolution data effectively.
We propose a latent-to-spatial mapping technique that enables direct visualization of how latent features correspond to specific regions in the input volumes.
Our results highlight the potential of SSM-based self-supervised learning to transform radiomics analysis by combining efficiency and interpretability.
arXiv Detail & Related papers (2024-09-12T04:36:50Z) - Brain Tumor Segmentation in MRI Images with 3D U-Net and Contextual Transformer [0.5033155053523042]
This research presents an enhanced approach for precise segmentation of brain tumor masses in magnetic resonance imaging (MRI) using an advanced 3D-UNet model combined with a Context Transformer (CoT)
The proposed model synchronizes tumor mass characteristics from CoT, mutually reinforcing feature extraction, facilitating the precise capture of detailed tumor mass structures.
Several experimental results present the outstanding segmentation performance of the proposed method in comparison to current state-of-the-art approaches.
arXiv Detail & Related papers (2024-07-11T13:04:20Z) - SDR-Former: A Siamese Dual-Resolution Transformer for Liver Lesion
Classification Using 3D Multi-Phase Imaging [59.78761085714715]
This study proposes a novel Siamese Dual-Resolution Transformer (SDR-Former) framework for liver lesion classification.
The proposed framework has been validated through comprehensive experiments on two clinical datasets.
To support the scientific community, we are releasing our extensive multi-phase MR dataset for liver lesion analysis to the public.
arXiv Detail & Related papers (2024-02-27T06:32:56Z) - Cross-modality Guidance-aided Multi-modal Learning with Dual Attention
for MRI Brain Tumor Grading [47.50733518140625]
Brain tumor represents one of the most fatal cancers around the world, and is very common in children and the elderly.
We propose a novel cross-modality guidance-aided multi-modal learning with dual attention for addressing the task of MRI brain tumor grading.
arXiv Detail & Related papers (2024-01-17T07:54:49Z) - Automated Ensemble-Based Segmentation of Adult Brain Tumors: A Novel
Approach Using the BraTS AFRICA Challenge Data [0.0]
We introduce an ensemble method that comprises eleven unique variations based on three core architectures.
Our findings reveal that the ensemble approach, combining different architectures, outperforms single models.
These results underline the potential of tailored deep learning techniques in precisely segmenting brain tumors.
arXiv Detail & Related papers (2023-08-14T15:34:22Z) - Cross-Modality Deep Feature Learning for Brain Tumor Segmentation [158.8192041981564]
This paper proposes a novel cross-modality deep feature learning framework to segment brain tumors from the multi-modality MRI data.
The core idea is to mine rich patterns across the multi-modality data to make up for the insufficient data scale.
Comprehensive experiments are conducted on the BraTS benchmarks, which show that the proposed cross-modality deep feature learning framework can effectively improve the brain tumor segmentation performance.
arXiv Detail & Related papers (2022-01-07T07:46:01Z) - Revisiting 3D Context Modeling with Supervised Pre-training for
Universal Lesion Detection in CT Slices [48.85784310158493]
We propose a Modified Pseudo-3D Feature Pyramid Network (MP3D FPN) to efficiently extract 3D context enhanced 2D features for universal lesion detection in CT slices.
With the novel pre-training method, the proposed MP3D FPN achieves state-of-the-art detection performance on the DeepLesion dataset.
The proposed 3D pre-trained weights can potentially be used to boost the performance of other 3D medical image analysis tasks.
arXiv Detail & Related papers (2020-12-16T07:11:16Z) - Inter-slice Context Residual Learning for 3D Medical Image Segmentation [38.43650000401734]
We propose the 3D context residual network (ConResNet) for the accurate segmentation of 3D medical images.
This model consists of an encoder, a segmentation decoder, and a context residual decoder.
We show that the proposed ConResNet is more accurate than six top-ranking methods in brain tumor segmentation and seven top-ranking methods in pancreas segmentation.
arXiv Detail & Related papers (2020-11-28T16:03:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.