LaCViT: A Label-aware Contrastive Fine-tuning Framework for Vision
Transformers
- URL: http://arxiv.org/abs/2303.18013v3
- Date: Mon, 5 Feb 2024 22:46:06 GMT
- Title: LaCViT: A Label-aware Contrastive Fine-tuning Framework for Vision
Transformers
- Authors: Zijun Long, Zaiqiao Meng, Gerardo Aragon Camarasa, Richard McCreadie
- Abstract summary: Vision Transformers (ViTs) have emerged as popular models in computer vision, demonstrating state-of-the-art performance across various tasks.
We introduce a novel Label-aware Contrastive Training framework, LaCViT, which significantly enhances the quality of embeddings in ViTs.
LaCViT statistically significantly enhances the performance of three evaluated ViTs by up-to 10.78% under Top-1 Accuracy.
- Score: 18.76039338977432
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision Transformers (ViTs) have emerged as popular models in computer vision,
demonstrating state-of-the-art performance across various tasks. This success
typically follows a two-stage strategy involving pre-training on large-scale
datasets using self-supervised signals, such as masked random patches, followed
by fine-tuning on task-specific labeled datasets with cross-entropy loss.
However, this reliance on cross-entropy loss has been identified as a limiting
factor in ViTs, affecting their generalization and transferability to
downstream tasks. Addressing this critical challenge, we introduce a novel
Label-aware Contrastive Training framework, LaCViT, which significantly
enhances the quality of embeddings in ViTs. LaCViT not only addresses the
limitations of cross-entropy loss but also facilitates more effective transfer
learning across diverse image classification tasks. Our comprehensive
experiments on eight standard image classification datasets reveal that LaCViT
statistically significantly enhances the performance of three evaluated ViTs by
up-to 10.78% under Top-1 Accuracy.
Related papers
- The Sword of Damocles in ViTs: Computational Redundancy Amplifies Adversarial Transferability [38.32538271219404]
We investigate the role of computational redundancy in Vision Transformers (ViTs) and its impact on adversarial transferability.
We identify two forms of redundancy, including the data-level and model-level, that can be harnessed to amplify attack effectiveness.
Building on this insight, we design a suite of techniques, including attention sparsity manipulation, attention head permutation, clean token regularization, ghost MoE diversification, and test-time adversarial training.
arXiv Detail & Related papers (2025-04-15T01:59:47Z) - Hierarchical Side-Tuning for Vision Transformers [33.536948382414316]
Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks.
PETL has shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning.
This paper introduces Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks.
arXiv Detail & Related papers (2023-10-09T04:16:35Z) - Transferable Adversarial Attacks on Vision Transformers with Token
Gradient Regularization [32.908816911260615]
Vision transformers (ViTs) have been successfully deployed in a variety of computer vision tasks, but they are still vulnerable to adversarial samples.
transfer-based attacks use a local model to generate adversarial samples and directly transfer them to attack a target black-box model.
We propose the Token Gradient Regularization (TGR) method to overcome the shortcomings of existing approaches.
arXiv Detail & Related papers (2023-03-28T06:23:17Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Deeper Insights into ViTs Robustness towards Common Corruptions [82.79764218627558]
We investigate how CNN-like architectural designs and CNN-based data augmentation strategies impact on ViTs' robustness towards common corruptions.
We demonstrate that overlapping patch embedding and convolutional Feed-Forward Network (FFN) boost performance on robustness.
We also introduce a novel conditional method enabling input-varied augmentations from two angles.
arXiv Detail & Related papers (2022-04-26T08:22:34Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Learning Invariant Representations across Domains and Tasks [81.30046935430791]
We propose a novel Task Adaptation Network (TAN) to solve this unsupervised task transfer problem.
In addition to learning transferable features via domain-adversarial training, we propose a novel task semantic adaptor that uses the learning-to-learn strategy to adapt the task semantics.
TAN significantly increases the recall and F1 score by 5.0% and 7.8% compared to recently strong baselines.
arXiv Detail & Related papers (2021-03-03T11:18:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.