TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation
- URL: http://arxiv.org/abs/2512.21135v1
- Date: Wed, 24 Dec 2025 12:06:26 GMT
- Title: TGC-Net: A Structure-Aware and Semantically-Aligned Framework for Text-Guided Medical Image Segmentation
- Authors: Gaoren Lin, Huangxuan Zhao, Yuan Xiong, Lefei Zhang, Bo Du, Wentao Zhu,
- Abstract summary: We propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations.<n>TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
- Score: 56.09179939570486
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-guided medical segmentation enhances segmentation accuracy by utilizing clinical reports as auxiliary information. However, existing methods typically rely on unaligned image and text encoders, which necessitate complex interaction modules for multimodal fusion. While CLIP provides a pre-aligned multimodal feature space, its direct application to medical imaging is limited by three main issues: insufficient preservation of fine-grained anatomical structures, inadequate modeling of complex clinical descriptions, and domain-specific semantic misalignment. To tackle these challenges, we propose TGC-Net, a CLIP-based framework focusing on parameter-efficient, task-specific adaptations. Specifically, it incorporates a Semantic-Structural Synergy Encoder (SSE) that augments CLIP's ViT with a CNN branch for multi-scale structural refinement, a Domain-Augmented Text Encoder (DATE) that injects large-language-model-derived medical knowledge, and a Vision-Language Calibration Module (VLCM) that refines cross-modal correspondence in a unified feature space. Experiments on five datasets across chest X-ray and thoracic CT modalities demonstrate that TGC-Net achieves state-of-the-art performance with substantially fewer trainable parameters, including notable Dice gains on challenging benchmarks.
Related papers
- SwinTF3D: A Lightweight Multimodal Fusion Approach for Text-Guided 3D Medical Image Segmentation [0.30586855806896035]
We propose SwinTF3D, a lightweight multimodal fusion approach that unifies visual and linguistic representations for text-guided 3D medical image segmentation.<n>SwinTF3D achieves competitive Dice and IoU scores across multiple organs, despite its compact architecture.
arXiv Detail & Related papers (2025-12-28T11:00:05Z) - DTEA: Dynamic Topology Weaving and Instability-Driven Entropic Attenuation for Medical Image Segmentation [31.50032207382483]
skip connections are used to merge global context and reduce the semantic gap between encoder and decoder.<n>We propose the DTEA model, featuring a new skip connection framework with the Semantic Topology Reconfiguration (STR) and Entropic Perturbation Gating (EPG) modules.
arXiv Detail & Related papers (2025-10-13T10:50:41Z) - Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation [56.52520416420957]
We propose Multimodal Causal-Driven Representation Learning (MCDRL) to tackle domain generalization in medical image segmentation.<n>MCDRL consistently outperforms competing methods, yielding superior segmentation accuracy and exhibiting robust generalizability.
arXiv Detail & Related papers (2025-08-07T03:41:41Z) - CDPDNet: Integrating Text Guidance with Hybrid Vision Encoders for Medical Image Segmentation [8.56773843063124]
Most publicly available medical segmentation datasets are only partially labeled.<n>In this study, we proposed a novel CLIP-DINO Prompt-Driven Network (CDPDNet)<n>CDPDNet combines a self-supervised vision transformer with CLIP-based text embedding and introduced task-specific text prompts to tackle these challenges.
arXiv Detail & Related papers (2025-05-25T03:23:58Z) - CENet: Context Enhancement Network for Medical Image Segmentation [3.4690322157094573]
We propose the Context Enhancement Network (CENet), a novel segmentation framework featuring two key innovations.<n>First, the Dual Selective Enhancement Block (DSEB) integrated into skip connections enhances boundary details and improves the detection of smaller organs in a context-aware manner.<n>Second, the Context Feature Attention Module (CFAM) in the decoder employs a multi-scale design to maintain spatial integrity, reduce feature redundancy, and mitigate overly enhanced representations.
arXiv Detail & Related papers (2025-05-23T23:22:18Z) - MulModSeg: Enhancing Unpaired Multi-Modal Medical Image Segmentation with Modality-Conditioned Text Embedding and Alternating Training [10.558275557142137]
We propose a simple Multi-Modal (MulModSeg) strategy to enhance medical image segmentation across multiple modalities.
MulModSeg incorporates a modality-conditioned text embedding framework via a frozen text encoder.
It consistently outperforms previous methods in segmenting abdominal multiorgan and cardiac substructures for both CT and MR.
arXiv Detail & Related papers (2024-11-23T14:37:01Z) - Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation [50.433911327489554]
The goal of referring remote sensing image segmentation (RRSIS) is to generate a pixel-level mask of the target object identified by the referring expression.<n>To address the aforementioned challenges, a novel RRSIS framework is proposed, termed the cross-modal bidirectional interaction model (CroBIM)<n>To further forster the research of RRSIS, we also construct RISBench, a new large-scale benchmark dataset comprising 52,472 image-language-label triplets.
arXiv Detail & Related papers (2024-10-11T08:28:04Z) - BCS-Net: Boundary, Context and Semantic for Automatic COVID-19 Lung
Infection Segmentation from CT Images [83.82141604007899]
BCS-Net is a novel network for automatic COVID-19 lung infection segmentation from CT images.
BCS-Net follows an encoder-decoder architecture, and more designs focus on the decoder stage.
In each BCSR block, the attention-guided global context (AGGC) module is designed to learn the most valuable encoder features for decoder.
arXiv Detail & Related papers (2022-07-17T08:54:07Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - TransUNet: Transformers Make Strong Encoders for Medical Image
Segmentation [78.01570371790669]
Medical image segmentation is an essential prerequisite for developing healthcare systems.
On various medical image segmentation tasks, the u-shaped architecture, also known as U-Net, has become the de-facto standard.
We propose TransUNet, which merits both Transformers and U-Net, as a strong alternative for medical image segmentation.
arXiv Detail & Related papers (2021-02-08T16:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.