Learning with Difference Attention for Visually Grounded Self-supervised
Representations
- URL: http://arxiv.org/abs/2306.14603v1
- Date: Mon, 26 Jun 2023 11:27:55 GMT
- Title: Learning with Difference Attention for Visually Grounded Self-supervised
Representations
- Authors: Aishwarya Agarwal and Srikrishna Karanam and Balaji Vasan Srinivasan
- Abstract summary: We propose visual difference attention (VDA) to compute visual attention maps in an unsupervised fashion.
We show VA does not highlight all salient regions in an image accurately, suggesting their inability to learn strong representations for downstream tasks like segmentation.
Motivated by these limitations, we propose a new learning objective, Differentiable Difference Attention (DiDA) loss, which leads to substantial improvements in an SSL model's visually grounding to an image's salient regions.
- Score: 18.743052370916192
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works in self-supervised learning have shown impressive results on
single-object images, but they struggle to perform well on complex multi-object
images as evidenced by their poor visual grounding. To demonstrate this
concretely, we propose visual difference attention (VDA) to compute visual
attention maps in an unsupervised fashion by comparing an image with its
salient-regions-masked-out version. We use VDA to derive attention maps for
state-of-the art SSL methods and show they do not highlight all salient regions
in an image accurately, suggesting their inability to learn strong
representations for downstream tasks like segmentation. Motivated by these
limitations, we cast VDA as a differentiable operation and propose a new
learning objective, Differentiable Difference Attention (DiDA) loss, which
leads to substantial improvements in an SSL model's visually grounding to an
image's salient regions.
Related papers
- CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Attention-Guided Masked Autoencoders For Learning Image Representations [16.257915216763692]
Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks.
We propose to inform the reconstruction process through an attention-guided loss function.
Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE.
arXiv Detail & Related papers (2024-02-23T08:11:25Z) - C-SAW: Self-Supervised Prompt Learning for Image Generalization in
Remote Sensing [12.930814370829893]
We focus on domain and class generalization problems in analyzing optical remote sensing images, using the large-scale pre-trained vision-language model (VLM), CLIP.
Existing prompt learning techniques overlook the importance of incorporating domain and content information into the prompts.
We propose a solution that ensures domain-invariant prompt learning while enhancing the expressiveness of visual features.
arXiv Detail & Related papers (2023-11-27T13:35:20Z) - StyleEDL: Style-Guided High-order Attention Network for Image Emotion
Distribution Learning [69.06749934902464]
We propose a style-guided high-order attention network for image emotion distribution learning termed StyleEDL.
StyleEDL interactively learns stylistic-aware representations of images by exploring the hierarchical stylistic information of visual contents.
In addition, we introduce a stylistic graph convolutional network to dynamically generate the content-dependent emotion representations.
arXiv Detail & Related papers (2023-08-06T03:22:46Z) - Saliency Guided Contrastive Learning on Scene Images [71.07412958621052]
We leverage the saliency map derived from the model's output during learning to highlight discriminative regions and guide the whole contrastive learning.
Our method significantly improves the performance of self-supervised learning on scene images by +1.1, +4.3, +2.2 Top1 accuracy in ImageNet linear evaluation, Semi-supervised learning with 1% and 10% ImageNet labels, respectively.
arXiv Detail & Related papers (2023-02-22T15:54:07Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Unsupervised Learning of Dense Visual Representations [14.329781842154281]
We propose View-Agnostic Dense Representation (VADeR) for unsupervised learning of dense representations.
VADeR learns pixelwise representations by forcing local features to remain constant over different viewing conditions.
Our method outperforms ImageNet supervised pretraining in multiple dense prediction tasks.
arXiv Detail & Related papers (2020-11-11T01:28:11Z) - Unsupervised Deep Metric Learning with Transformed Attention Consistency
and Contrastive Clustering Loss [28.17607283348278]
Existing approaches for unsupervised metric learning focus on exploring self-supervision information within the input image itself.
We observe that, when analyzing images, human eyes often compare images against each other instead of examining images individually.
We develop a new approach to unsupervised deep metric learning where the network is learned based on self-supervision information across images.
arXiv Detail & Related papers (2020-08-10T19:33:47Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.