Learning Self-Regularized Adversarial Views for Self-Supervised Vision
Transformers
- URL: http://arxiv.org/abs/2210.08458v1
- Date: Sun, 16 Oct 2022 06:20:44 GMT
- Title: Learning Self-Regularized Adversarial Views for Self-Supervised Vision
Transformers
- Authors: Tao Tang, Changlin Li, Guangrun Wang, Kaicheng Yu, Xiaojun Chang,
Xiaodan Liang
- Abstract summary: We propose a self-regularized AutoAugment method to learn views for self-supervised vision transformers.
First, we reduce the search cost of AutoView to nearly zero by learning views and network parameters simultaneously.
We also present a curated augmentation policy search space for self-supervised learning.
- Score: 105.89564687747134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic data augmentation (AutoAugment) strategies are indispensable in
supervised data-efficient training protocols of vision transformers, and have
led to state-of-the-art results in supervised learning. Despite the success,
its development and application on self-supervised vision transformers have
been hindered by several barriers, including the high search cost, the lack of
supervision, and the unsuitable search space. In this work, we propose
AutoView, a self-regularized adversarial AutoAugment method, to learn views for
self-supervised vision transformers, by addressing the above barriers. First,
we reduce the search cost of AutoView to nearly zero by learning views and
network parameters simultaneously in a single forward-backward step, minimizing
and maximizing the mutual information among different augmented views,
respectively. Then, to avoid information collapse caused by the lack of label
supervision, we propose a self-regularized loss term to guarantee the
information propagation. Additionally, we present a curated augmentation policy
search space for self-supervised learning, by modifying the generally used
search space designed for supervised learning. On ImageNet, our AutoView
achieves remarkable improvement over RandAug baseline (+10.2% k-NN accuracy),
and consistently outperforms sota manually tuned view policy by a clear margin
(up to +1.3% k-NN accuracy). Extensive experiments show that AutoView
pretraining also benefits downstream tasks (+1.2% mAcc on ADE20K Semantic
Segmentation and +2.8% mAP on revisited Oxford Image Retrieval benchmark) and
improves model robustness (+2.3% Top-1 Acc on ImageNet-A and +1.0% AUPR on
ImageNet-O). Code and models will be available at
https://github.com/Trent-tangtao/AutoView.
Related papers
- Cross-Camera Distracted Driver Classification through Feature Disentanglement and Contrastive Learning [13.613407983544427]
We introduce a robust model designed to withstand changes in camera position within the vehicle.
Our Driver Behavior Monitoring Network (DBMNet) relies on a lightweight backbone and integrates a disentanglement module.
Experiments conducted on the daytime and nighttime subsets of the 100-Driver dataset validate the effectiveness of our approach.
arXiv Detail & Related papers (2024-11-20T10:27:12Z) - Federated Self-Supervised Learning of Monocular Depth Estimators for
Autonomous Vehicles [0.0]
FedSCDepth is a novel method that combines federated learning and deep self-supervision to enable the learning of monocular depth estimators.
Our proposed method achieves near state-of-the-art performance, with a test loss below 0.13 and requiring, on average, only 1.5k training steps.
arXiv Detail & Related papers (2023-10-07T14:54:02Z) - A Novel Driver Distraction Behavior Detection Method Based on
Self-supervised Learning with Masked Image Modeling [5.1680226874942985]
Driver distraction causes a significant number of traffic accidents every year, resulting in economic losses and casualties.
Driver distraction detection primarily relies on traditional convolutional neural networks (CNN) and supervised learning methods.
This paper proposes a new self-supervised learning method based on masked image modeling for driver distraction behavior detection.
arXiv Detail & Related papers (2023-06-01T10:53:32Z) - M$^2$DAR: Multi-View Multi-Scale Driver Action Recognition with Vision
Transformer [5.082919518353888]
We present a multi-view, multi-scale framework for naturalistic driving action recognition and localization in untrimmed videos.
Our system features a weight-sharing, multi-scale Transformer-based action recognition network that learns robust hierarchical representations.
arXiv Detail & Related papers (2023-05-13T02:38:15Z) - Self-Supervised Representation Learning from Temporal Ordering of
Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks.
We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems.
Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - Fine-Grained Vehicle Perception via 3D Part-Guided Visual Data
Augmentation [77.60050239225086]
We propose an effective training data generation process by fitting a 3D car model with dynamic parts to vehicles in real images.
Our approach is fully automatic without any human interaction.
We present a multi-task network for VUS parsing and a multi-stream network for VHI parsing.
arXiv Detail & Related papers (2020-12-15T03:03:38Z) - Auto-Rectify Network for Unsupervised Indoor Depth Estimation [119.82412041164372]
We establish that the complex ego-motions exhibited in handheld settings are a critical obstacle for learning depth.
We propose a data pre-processing method that rectifies training images by removing their relative rotations for effective learning.
Our results outperform the previous unsupervised SOTA method by a large margin on the challenging NYUv2 dataset.
arXiv Detail & Related papers (2020-06-04T08:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.