Learning Object Focused Attention
- URL: http://arxiv.org/abs/2504.08166v1
- Date: Thu, 10 Apr 2025 23:23:26 GMT
- Title: Learning Object Focused Attention
- Authors: Vivek Trivedy, Amani Almalki, Longin Jan Latecki,
- Abstract summary: We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation.<n>This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss.<n>Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability, and learn representations based on object shapes rather than spurious correlations via general textures.
- Score: 5.340670496809963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an adaptation to the training of Vision Transformers (ViTs) that allows for an explicit modeling of objects during the attention computation. This is achieved by adding a new branch to selected attention layers that computes an auxiliary loss which we call the object-focused attention (OFA) loss. We restrict the attention to image patches that belong to the same object class, which allows ViTs to gain a better understanding of configural (or holistic) object shapes by focusing on intra-object patches instead of other patches such as those in the background. Our proposed inductive bias fits easily into the attention framework of transformers since it only adds an auxiliary loss over selected attention layers. Furthermore, our approach has no additional overhead during inference. We also experiment with multiscale masking to further improve the performance of our OFA model and give a path forward for self-supervised learning with our method. Our experimental results demonstrate that ViTs with OFA achieve better classification results than their base models, exhibit a stronger generalization ability to out-of-distribution (OOD) and adversarially corrupted images, and learn representations based on object shapes rather than spurious correlations via general textures. For our OOD setting, we generate a novel dataset using the COCO dataset and Stable Diffusion inpainting which we plan to share with the community.
Related papers
- BOOTPLACE: Bootstrapped Object Placement with Detection Transformers [23.300369070771836]
We introduce BOOTPLACE, a novel paradigm that formulates object placement as a placement-by-detection problem.<n> Experimental results on established benchmarks demonstrate BOOTPLACE's superior performance in object repositioning.
arXiv Detail & Related papers (2025-03-27T21:21:20Z) - Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering [1.8786950286587742]
As model size increases, high-norm artifacts anomaly appears in the patches of multi-head attention.
We propose Inference-Time Attention Engineering (ITAE) which manipulates attention function during inference.
ITAE shows improved clustering accuracy on multiple datasets by exhibiting more expressive features in latent space.
arXiv Detail & Related papers (2024-10-07T07:26:10Z) - Localized Gaussians as Self-Attention Weights for Point Clouds Correspondence [92.07601770031236]
We investigate semantically meaningful patterns in the attention heads of an encoder-only Transformer architecture.
We find that fixing the attention weights not only accelerates the training process but also enhances the stability of the optimization.
arXiv Detail & Related papers (2024-09-20T07:41:47Z) - Attention-Guided Masked Autoencoders For Learning Image Representations [16.257915216763692]
Masked autoencoders (MAEs) have established themselves as a powerful method for unsupervised pre-training for computer vision tasks.
We propose to inform the reconstruction process through an attention-guided loss function.
Our evaluations show that our pre-trained models learn better latent representations than the vanilla MAE.
arXiv Detail & Related papers (2024-02-23T08:11:25Z) - Explicitly Disentangled Representations in Object-Centric Learning [0.0]
We propose a novel architecture that biases object-centric models toward disentangling shape and texture components.<n>In particular, we propose a novel architecture that biases object-centric models toward disentangling shape and texture components.
arXiv Detail & Related papers (2024-01-18T17:22:11Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [55.12082817901671]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)<n>MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.<n>Our results demonstrate that MaPeT achieves competitive performance on ImageNet, compared to baselines and competitors under the same model setting.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Patch-level Representation Learning for Self-supervised Vision
Transformers [68.8862419248863]
Vision Transformers (ViTs) have gained much attention recently as a better architectural choice, often outperforming convolutional networks for various visual tasks.
Inspired by this, we design a simple yet effective visual pretext task, coined SelfPatch, for learning better patch-level representations.
We demonstrate that SelfPatch can significantly improve the performance of existing SSL methods for various visual tasks.
arXiv Detail & Related papers (2022-06-16T08:01:19Z) - Object-wise Masked Autoencoders for Fast Pre-training [13.757095663704858]
We show that current masked image encoding models learn the underlying relationship between all objects in the whole scene, instead of a single object representation.
We introduce a novel object selection and division strategy to drop non-object patches for learning object-wise representations by selective reconstruction with interested region masks.
Experiments on four commonly-used datasets demonstrate the effectiveness of our model in reducing the compute cost by 72% while achieving competitive performance.
arXiv Detail & Related papers (2022-05-28T05:13:45Z) - Object-aware Contrastive Learning for Debiased Scene Representation [74.30741492814327]
We develop a novel object-aware contrastive learning framework that localizes objects in a self-supervised manner.
We also introduce two data augmentations based on ContraCAM, object-aware random crop and background mixup, which reduce contextual and background biases during contrastive self-supervised learning.
arXiv Detail & Related papers (2021-07-30T19:24:07Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.