TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised
Object Localization
- URL: http://arxiv.org/abs/2103.14862v1
- Date: Sat, 27 Mar 2021 09:43:16 GMT
- Title: TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised
Object Localization
- Authors: Wei Gao, Fang Wan, Xingjia Pan, Zhiliang Peng, Qi Tian, Zhenjun Han,
Bolei Zhou, Qixiang Ye
- Abstract summary: Weakly supervised object localization (WSOL) is a challenging problem when given image category labels.
We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction.
- Score: 112.46381729542658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised object localization (WSOL) is a challenging problem when
given image category labels but requires to learn object localization models.
Optimizing a convolutional neural network (CNN) for classification tends to
activate local discriminative regions while ignoring complete object extent,
causing the partial activation issue. In this paper, we argue that partial
activation is caused by the intrinsic characteristics of CNN, where the
convolution operations produce local receptive fields and experience difficulty
to capture long-range feature dependency among pixels. We introduce the token
semantic coupled attention map (TS-CAM) to take full advantage of the
self-attention mechanism in visual transformer for long-range dependency
extraction. TS-CAM first splits an image into a sequence of patch tokens for
spatial embedding, which produce attention maps of long-range visual dependency
to avoid partial activation. TS-CAM then re-allocates category-related
semantics for patch tokens, enabling each of them to be aware of object
categories. TS-CAM finally couples the patch tokens with the semantic-agnostic
attention map to achieve semantic-aware localization. Experiments on the
ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM
counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance.
Related papers
- Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection.
The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features.
Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z) - Semantic-Constraint Matching Transformer for Weakly Supervised Object
Localization [31.039698757869974]
Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision.
Previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope.
We propose a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation.
arXiv Detail & Related papers (2023-09-04T03:20:31Z) - Spatial-Aware Token for Weakly Supervised Object Localization [137.0570026552845]
We propose a task-specific spatial-aware token to condition localization in a weakly supervised manner.
Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc.
arXiv Detail & Related papers (2023-03-18T15:38:17Z) - DQnet: Cross-Model Detail Querying for Camouflaged Object Detection [54.82390534024954]
A convolutional neural network (CNN) for camouflaged object detection tends to activate local discriminative regions while ignoring complete object extent.
In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN.
In order to obtain feature maps that could activate full object extent, a novel framework termed Cross-Model Detail Querying network (DQnet) is proposed.
arXiv Detail & Related papers (2022-12-16T06:23:58Z) - Attention-based Class Activation Diffusion for Weakly-Supervised
Semantic Segmentation [98.306533433627]
extracting class activation maps (CAM) is a key step for weakly-supervised semantic segmentation (WSSS)
This paper proposes a new method to couple CAM and Attention matrix in a probabilistic Diffusion way, and dub it AD-CAM.
Experiments show that AD-CAM as pseudo labels can yield stronger WSSS models than the state-of-the-art variants of CAM.
arXiv Detail & Related papers (2022-11-20T10:06:32Z) - Saliency Guided Inter- and Intra-Class Relation Constraints for Weakly
Supervised Semantic Segmentation [66.87777732230884]
We propose a saliency guided Inter- and Intra-Class Relation Constrained (I$2$CRC) framework to assist the expansion of the activated object regions.
We also introduce an object guided label refinement module to take a full use of both the segmentation prediction and the initial labels for obtaining superior pseudo-labels.
arXiv Detail & Related papers (2022-06-20T03:40:56Z) - Anti-Adversarially Manipulated Attributions for Weakly Supervised
Semantic Segmentation and Object Localization [31.69344455448125]
We present an attribution map of an image that is manipulated to increase the classification score produced by a classifier before the final softmax or sigmoid layer.
This manipulation is realized in an anti-adversarial manner, so that the original image is perturbed along pixel gradients in directions opposite to those used in an adversarial attack.
In addition, we introduce a new regularization procedure that inhibits the incorrect attribution of regions unrelated to the target object and the excessive concentration of attributions on a small region of the target object.
arXiv Detail & Related papers (2022-04-11T06:18:02Z) - Contrastive learning of Class-agnostic Activation Map for Weakly
Supervised Object Localization and Semantic Segmentation [32.76127086403596]
We propose Contrastive learning for Class-agnostic Activation Map (C$2$AM) generation using unlabeled image data.
We form the positive and negative pairs based on the above relations and force the network to disentangle foreground and background.
As the network is guided to discriminate cross-image foreground-background, the class-agnostic activation maps learned by our approach generate more complete object regions.
arXiv Detail & Related papers (2022-03-25T08:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.