DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation
- URL: http://arxiv.org/abs/2512.10314v1
- Date: Thu, 11 Dec 2025 06:03:28 GMT
- Title: DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation
- Authors: Anh M. Vu, Khang P. Le, Trang T. K. Vo, Ha Thach, Huy Hung Nguyen, David Yang, Han H. Huynh, Quynh Nguyen, Tuan M. Pham, Tuan-Anh Le, Minh H. N. Le, Thanh-Huy Nguyen, Akash Awasthi, Chandra Mohan, Zhu Han, Hien Van Nguyen,
- Abstract summary: We propose a prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision.<n>Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank.
- Score: 19.307501518696622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.
Related papers
- Multimodal Prototype Alignment for Semi-supervised Pathology Image Segmentation [9.790130257265217]
MPAMatch is a novel segmentation framework that performs pixel-level contrastive learning under a multimodal prototype-guided supervision paradigm.<n>The core innovation of MPAMatch lies in the dual contrastive learning scheme between image prototypes and pixel labels, and between text prototypes and pixel labels.<n>In addition, we reconstruct the classic segmentation architecture (TransUNet) by replacing its ViT backbone with a pathology-pretrained foundation model (Uni)
arXiv Detail & Related papers (2025-08-27T05:15:13Z) - BiPVL-Seg: Bidirectional Progressive Vision-Language Fusion with Global-Local Alignment for Medical Image Segmentation [9.262045402495225]
BiPVL-Seg is an end-to-end framework that integrates vision-language fusion and embedding alignment.<n>BiPVL-Seg introduces progressive fusion in the architecture, which facilitates stage-wise information exchange between vision and text encoders.<n>It incorporates global-local contrastive alignment, a training objective that enhances the text encoder's comprehension by aligning text and vision embeddings at both class and concept levels.
arXiv Detail & Related papers (2025-03-30T17:34:39Z) - Prototype-Based Image Prompting for Weakly Supervised Histopathological Image Segmentation [13.640757848445835]
Weakly supervised image segmentation with image-level labels has drawn attention due to the high cost of pixel-level annotations.<n>Traditional methods using Class Activation Maps (CAMs) often highlight only the most discriminative regions.
arXiv Detail & Related papers (2025-03-15T09:55:31Z) - ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z) - Toward Modality Gap: Vision Prototype Learning for Weakly-supervised Semantic Segmentation with CLIP [19.697857943845012]
We propose a framework to learn class-specific vision prototypes in vision space with the help of text prototypes.<n>We also propose a regional semantic contrast module that contrasts regions embedding with corresponding prototypes.<n>Our proposed framework achieves state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2024-12-27T13:55:11Z) - A Multimodal Approach Combining Structural and Cross-domain Textual Guidance for Weakly Supervised OCT Segmentation [12.948027961485536]
We propose a novel Weakly Supervised Semantic (WSSS) approach that integrates structural guidance with text-driven strategies to generate high-quality pseudo labels.
Our method achieves state-of-the-art performance, highlighting its potential to improve diagnostic accuracy and efficiency in medical imaging.
arXiv Detail & Related papers (2024-11-19T16:20:27Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - TediGAN: Text-Guided Diverse Face Image Generation and Manipulation [52.83401421019309]
TediGAN is a framework for multi-modal image generation and manipulation with textual descriptions.
StyleGAN inversion module maps real images to the latent space of a well-trained StyleGAN.
visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space.
instance-level optimization is for identity preservation in manipulation.
arXiv Detail & Related papers (2020-12-06T16:20:19Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.