MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation
- URL: http://arxiv.org/abs/2412.11076v3
- Date: Fri, 17 Jan 2025 07:21:34 GMT
- Title: MoRe: Class Patch Attention Needs Regularization for Weakly Supervised Semantic Segmentation
- Authors: Zhiwei Yang, Yucong Meng, Kexue Fu, Shuo Wang, Zhijian Song,
- Abstract summary: We propose MoRe to address the issue of falsely activating patch regions with minimal semantic relevance.
Our findings suggest that imposing additional regularization on class-patch attention is necessary.
MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods.
- Score: 12.308473939796945
- License:
- Abstract: Weakly Supervised Semantic Segmentation (WSSS) with image-level labels typically uses Class Activation Maps (CAM) to achieve dense predictions. Recently, Vision Transformer (ViT) has provided an alternative to generate localization maps from class-patch attention. However, due to insufficient constraints on modeling such attention, we observe that the Localization Attention Maps (LAM) often struggle with the artifact issue, i.e., patch regions with minimal semantic relevance are falsely activated by class tokens. In this work, we propose MoRe to address this issue and further explore the potential of LAM. Our findings suggest that imposing additional regularization on class-patch attention is necessary. To this end, we first view the attention as a novel directed graph and propose the Graph Category Representation module to implicitly regularize the interaction among class-patch entities. It ensures that class tokens dynamically condense the related patch information and suppress unrelated artifacts at a graph level. Second, motivated by the observation that CAM from classification weights maintains smooth localization of objects, we devise the Localization-informed Regularization module to explicitly regularize the class-patch attention. It directly mines the token relations from CAM and further supervises the consistency between class and patch tokens in a learnable manner. Extensive experiments are conducted on PASCAL VOC and MS COCO, validating that MoRe effectively addresses the artifact issue and achieves state-of-the-art performance, surpassing recent single-stage and even multi-stage methods. Code is available at https://github.com/zwyang6/MoRe.
Related papers
- Spatial Action Unit Cues for Interpretable Deep Facial Expression Recognition [55.97779732051921]
State-of-the-art classifiers for facial expression recognition (FER) lack interpretability, an important feature for end-users.
A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models.
Our new strategy is generic, and can be applied to any deep CNN- or transformer-based classifier without requiring any architectural change or significant additional training time.
arXiv Detail & Related papers (2024-10-01T10:42:55Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - All-pairs Consistency Learning for Weakly Supervised Semantic
Segmentation [42.66269050864235]
We propose a new transformer-based regularization to better localize objects for Weakly supervised semantic segmentation (WSSS)
We adopt vision transformers as the self-attention mechanism naturally embeds pair-wise affinity.
Our method produces noticeably better class localization maps (67.3% mIoU on PASCAL VOC train)
arXiv Detail & Related papers (2023-08-08T15:14:23Z) - MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic
Segmentation [90.73815426893034]
We propose a transformer-based framework that aims to enhance weakly supervised semantic segmentation.
We introduce a Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with the patch tokens.
A Contrastive-Class-Token (CCT) module is proposed to enhance the learning of discriminative class tokens.
arXiv Detail & Related papers (2023-08-06T03:30:20Z) - Re-Attention Transformer for Weakly Supervised Object Localization [45.417606565085116]
We present a re-attention mechanism termed token refinement transformer (TRT) that captures the object-level semantics to guide the localization well.
Specifically, TRT introduces a novel module named token priority scoring module (TPSM) to suppress the effects of background noise while focusing on the target object.
arXiv Detail & Related papers (2022-08-03T04:34:28Z) - Saliency Guided Inter- and Intra-Class Relation Constraints for Weakly
Supervised Semantic Segmentation [66.87777732230884]
We propose a saliency guided Inter- and Intra-Class Relation Constrained (I$2$CRC) framework to assist the expansion of the activated object regions.
We also introduce an object guided label refinement module to take a full use of both the segmentation prediction and the initial labels for obtaining superior pseudo-labels.
arXiv Detail & Related papers (2022-06-20T03:40:56Z) - Multi-class Token Transformer for Weakly Supervised Semantic
Segmentation [94.78965643354285]
We propose a new transformer-based framework to learn class-specific object localization maps as pseudo labels for weakly supervised semantic segmentation (WSSS)
Inspired by the fact that the attended regions of the one-class token in the standard vision transformer can be leveraged to form a class-agnostic localization map, we investigate if the transformer model can also effectively capture class-specific attention for more discriminative object localization.
The proposed framework is shown to fully complement the Class Activation Mapping (CAM) method, leading to remarkably superior WSSS results on the PASCAL VOC and MS COCO datasets.
arXiv Detail & Related papers (2022-03-06T07:18:23Z) - TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised
Object Localization [112.46381729542658]
Weakly supervised object localization (WSOL) is a challenging problem when given image category labels.
We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction.
arXiv Detail & Related papers (2021-03-27T09:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.