Spatial Entropy Regularization for Vision Transformers
- URL: http://arxiv.org/abs/2206.04636v1
- Date: Thu, 9 Jun 2022 17:34:39 GMT
- Title: Spatial Entropy Regularization for Vision Transformers
- Authors: Elia Peruzzo, Enver Sangineto, Yahui Liu, Marco De Nadai, Wei Bi,
Bruno Lepri and Nicu Sebe
- Abstract summary: Vision Transformers (VTs) can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised.
We propose a VT regularization method based on a spatial formulation of the information entropy.
We show that the proposed regularization approach is beneficial with different training scenarios, datasets, downstream tasks and VT architectures.
- Score: 71.44392961125807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that the attention maps of Vision Transformers (VTs),
when trained with self-supervision, can contain a semantic segmentation
structure which does not spontaneously emerge when training is supervised. In
this paper, we explicitly encourage the emergence of this spatial clustering as
a form of training regularization, this way including a self-supervised pretext
task into the standard supervised learning. In more detail, we propose a VT
regularization method based on a spatial formulation of the information
entropy. By minimizing the proposed spatial entropy, we explicitly ask the VT
to produce spatially ordered attention maps, this way including an object-based
prior during training. Using extensive experiments, we show that the proposed
regularization approach is beneficial with different training scenarios,
datasets, downstream tasks and VT architectures. The code will be available
upon acceptance.
Related papers
- Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Cross-Regularization [78.61621802973262]
We introduce an Orthogonal finetuning method for efficiently updating pretrained weights.
A cross-regularization strategy is also exploited to maintain the stability in terms of zero-shot generalization.
We conduct extensive experiments to demonstrate that our method explicitly steers pretrained weight space to represent the task-specific knowledge.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining [66.08606211686339]
We provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining.
On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed local and diverse attention patterns.
On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings.
arXiv Detail & Related papers (2024-03-04T17:24:03Z) - BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN)
We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map.
Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Transfer RL across Observation Feature Spaces via Model-Based
Regularization [9.660642248872973]
In many reinforcement learning (RL) applications, the observation space is specified by human developers and restricted by physical realizations.
We propose a novel algorithm which extracts the latent-space dynamics in the source task, and transfers the dynamics model to the target task.
Our algorithm works for drastic changes of observation space without any inter-task mapping or any prior knowledge of the target task.
arXiv Detail & Related papers (2022-01-01T22:41:19Z) - Temporal Predictive Coding For Model-Based Planning In Latent Space [80.99554006174093]
We present an information-theoretic approach that employs temporal predictive coding to encode elements in the environment that can be predicted across time.
We evaluate our model on a challenging modification of standard DMControl tasks where the background is replaced with natural videos that contain complex but irrelevant information to the planning task.
arXiv Detail & Related papers (2021-06-14T04:31:15Z) - PEARL: Parallelized Expert-Assisted Reinforcement Learning for Scene
Rearrangement Planning [28.9887381071402]
We propose a fine-grained action definition for Scene Rearrangement Planning (SRP) and introduce a large-scale scene rearrangement dataset.
We also propose a novel learning paradigm to efficiently train an agent through self-playing, without any prior knowledge.
arXiv Detail & Related papers (2021-05-10T03:27:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.