Spatial Entropy Regularization for Vision Transformers
- URL: http://arxiv.org/abs/2206.04636v1
- Date: Thu, 9 Jun 2022 17:34:39 GMT
- Title: Spatial Entropy Regularization for Vision Transformers
- Authors: Elia Peruzzo, Enver Sangineto, Yahui Liu, Marco De Nadai, Wei Bi,
Bruno Lepri and Nicu Sebe
- Abstract summary: Vision Transformers (VTs) can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised.
We propose a VT regularization method based on a spatial formulation of the information entropy.
We show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures.
- Score: 71.44392961125807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that the attention maps of Vision Transformers (VTs),
when trained with self-supervision, can contain a semantic segmentation
structure which does not spontaneously emerge when training is supervised. In
this paper, we explicitly encourage the emergence of this spatial clustering as
a form of training regularization, this way including a self-supervised pretext
task into the standard supervised learning. In more detail, we propose a VT
regularization method based on a spatial formulation of the information
entropy. By minimizing the proposed spatial entropy, we explicitly ask the VT
to produce spatially ordered attention maps, this way including an object-based
prior during training. Using extensive experiments, we show that the proposed
regularization approach is beneficial with different training scenarios,
datasets, downstream tasks and VT architectures. The code will be available
upon acceptance.
Related papers
- Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Vision Transformers provably learn spatial structure [34.61885883486938]
Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision.
Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized patterns.
arXiv Detail & Related papers (2022-10-13T19:53:56Z) - Transfer RL across Observation Feature Spaces via Model-Based
Regularization [9.660642248872973]
In many reinforcement learning (RL) applications, the observation space is specified by human developers and restricted by physical realizations.
We propose a novel algorithm which extracts the latent-space dynamics in the source task, and transfers the dynamics model to the target task.
Our algorithm works for drastic changes of observation space without any inter-task mapping or any prior knowledge of the target task.
arXiv Detail & Related papers (2022-01-01T22:41:19Z) - Temporal Predictive Coding For Model-Based Planning In Latent Space [80.99554006174093]
We present an information-theoretic approach that employs temporal predictive coding to encode elements in the environment that can be predicted across time.
We evaluate our model on a challenging modification of standard DMControl tasks where the background is replaced with natural videos that contain complex but irrelevant information to the planning task.
arXiv Detail & Related papers (2021-06-14T04:31:15Z) - PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene
Rearrangement Planning [28.9887381071402]
We propose a fine-grained action definition for Scene Rearrangement Planning (SRP) and introduce a large-scale scene rearrangement dataset.
We also propose a novel learning paradigm to efficiently train an agent through self-playing, without any prior knowledge.
arXiv Detail & Related papers (2021-05-10T03:27:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.