Related papers: Context Patch Fusion With Class Token Enhancement for Weakly Supervised Semantic Segmentation

Context Patch Fusion With Class Token Enhancement for Weakly Supervised Semantic Segmentation

URL: http://arxiv.org/abs/2601.14718v1
Date: Wed, 21 Jan 2026 07:12:23 GMT
Title: Context Patch Fusion With Class Token Enhancement for Weakly Supervised Semantic Segmentation
Authors: Yiyang Fu, Hui Li, Wangyu Wu,
Abstract summary: We propose the Context Patch Fusion with Class Token Enhancement (CPF-CTE) framework.<n>CPF-CTE exploits contextual relations among patches to enrich feature representations and improve segmentation.<n>We show that CPF-CTE consistently surpasses prior WSSS methods.
Score: 5.859471574310312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly Supervised Semantic Segmentation (WSSS), which relies only on image-level labels, has attracted significant attention for its cost-effectiveness and scalability. Existing methods mainly enhance inter-class distinctions and employ data augmentation to mitigate semantic ambiguity and reduce spurious activations. However, they often neglect the complex contextual dependencies among image patches, resulting in incomplete local representations and limited segmentation accuracy. To address these issues, we propose the Context Patch Fusion with Class Token Enhancement (CPF-CTE) framework, which exploits contextual relations among patches to enrich feature representations and improve segmentation. At its core, the Contextual-Fusion Bidirectional Long Short-Term Memory (CF-BiLSTM) module captures spatial dependencies between patches and enables bidirectional information flow, yielding a more comprehensive understanding of spatial correlations. This strengthens feature learning and segmentation robustness. Moreover, we introduce learnable class tokens that dynamically encode and refine class-specific semantics, enhancing discriminative capability. By effectively integrating spatial and semantic cues, CPF-CTE produces richer and more accurate representations of image content. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate that CPF-CTE consistently surpasses prior WSSS methods.

Related papers

Multi-Text Guided Few-Shot Semantic Segmentation [17.27158303776253]
We propose the Multi-Text Guided Few-Shot Semantic Network (MTGNet) to enhance segmentation performance.<n>MTGNet fuses diverse textual prompts to refine textual priors and guide the cross-modal optimization of visual priors.<n>It achieves 76.8% mIoU on PASCAL-5i and 57.4% on COCO-20i, with notable improvements in folds exhibiting high intra-class variations.
arXiv Detail & Related papers (2025-11-19T15:09:19Z)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z)
FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation [1.4525238046020867]
Open-vocabulary semantic segmentation aims to segment objects from arbitrary text categories without requiring densely annotated datasets.<n>We present FA-Seg, a training-free framework for open-vocabulary segmentation based on diffusion models.
arXiv Detail & Related papers (2025-06-29T16:41:41Z)
Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder.<n>Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder.<n> Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Uncertainty-Participation Context Consistency Learning for Semi-supervised Semantic Segmentation [9.546065701435532]
Semi-supervised semantic segmentation has attracted considerable attention for its ability to mitigate the reliance on extensive labeled data.<n>This paper proposes the Uncertainty-participation Context Consistency Learning (UCCL) method to explore richer supervisory signals.
arXiv Detail & Related papers (2024-12-23T06:49:59Z)
DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation. We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity. Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z)
Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z)
FECANet: Boosting Few-Shot Semantic Segmentation with Feature-Enhanced Context-Aware Network [48.912196729711624]
Few-shot semantic segmentation is the task of learning to locate each pixel of a novel class in a query image with only a few annotated support images. We propose a Feature-Enhanced Context-Aware Network (FECANet) to suppress the matching noise caused by inter-class local similarity. In addition, we propose a novel correlation reconstruction module that encodes extra correspondence relations between foreground and background and multi-scale context semantic features.
arXiv Detail & Related papers (2023-01-19T16:31:13Z)
CTNet: Context-based Tandem Network for Semantic Segmentation [77.4337867789772]
This work proposes a novel Context-based Tandem Network (CTNet) by interactively exploring the spatial contextual information and the channel contextual information. To further improve the performance of the learned representations for semantic segmentation, the results of the two context modules are adaptively integrated.
arXiv Detail & Related papers (2021-04-20T07:33:11Z)
Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities. We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.