Related papers: What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

URL: http://arxiv.org/abs/2405.21070v3
Date: Sun, 27 Oct 2024 23:53:20 GMT
Title: What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights
Authors: Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi,
Abstract summary: Severe data imbalance naturally exists among web-scale vision-language datasets. We find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning. The robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts.
Score: 67.72413262980272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the mechanisms behind CLIP's generalizability beyond data imbalance but also provides transferable insights for the research community. The findings are validated in both supervised and self-supervised learning, enabling models trained on imbalanced data to achieve CLIP-level performance on diverse recognition tasks. Code and data are available at: https://github.com/CVMI-Lab/clip-beyond-tail.

Related papers

Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning [11.50324946279326]
Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks.<n>We analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models.<n>We propose a simple yet effective method, MG-CLIP, that improves CLIP's performance in class-incremental learning.
arXiv Detail & Related papers (2025-07-12T02:28:42Z)
What Matters for In-Context Learning: A Balancing Act of Look-up and In-Weight Learning [42.8453045943264]
We show that conceptual repetitions in the data sequences are crucial for ICL. We also show that the emergence of ICL depends on balancing the in-weight learning objective with the in-context solving ability.
arXiv Detail & Related papers (2025-01-09T09:45:05Z)
Adaptive Rank, Reduced Forgetting: Knowledge Retention in Continual Learning Vision-Language Models with Dynamic Rank-Selective LoRA [19.982853959240497]
Pre-trained vision-language embedding models such as CLIP have been widely adopted and validated in Continual Learning (CL)<n>Existing CL methods primarily focus on continual downstream adaptation using components isolated from the pre-trained model (PTM)<n>We propose a universal and efficient CL approach for CLIP based on Dynamic Rank-Selective LoRA (CoDyRA)
arXiv Detail & Related papers (2024-12-01T23:41:42Z)
Does the Definition of Difficulty Matter? Scoring Functions and their Role for Curriculum Learning [42.4526628515253]
Curriculum learning (CL) describes a machine learning training strategy in which samples are gradually introduced into the training process based on their difficulty. We study the robustness and similarity of the most common scoring functions for sample difficulty estimation. We find that the robustness of scoring functions across random seeds positively correlates with CL performance.
arXiv Detail & Related papers (2024-11-01T18:55:31Z)
A Survey of the Self Supervised Learning Mechanisms for Vision Transformers [5.152455218955949]
The application of self supervised learning (SSL) in vision tasks has gained significant attention. We develop a comprehensive taxonomy of systematically classifying the SSL techniques. We discuss the motivations behind SSL, review popular pre-training tasks, and highlight the challenges and advancements in this field.
arXiv Detail & Related papers (2024-08-30T07:38:28Z)
Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations [6.990891188823598]
We present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision.<n>Our framework is specifically designed to work on web-scraped data by not relying on negative examples in the self-supervised learning path.<n>We evaluate Harmony across various vision downstream tasks and find that it significantly outperforms the baseline CLIP.
arXiv Detail & Related papers (2024-05-23T07:18:08Z)
ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP) ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective. We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z)
On the Effectiveness of Equivariant Regularization for Robust Online Continual Learning [17.995662644298974]
Continual Learning (CL) approaches seek to bridge this gap by facilitating the transfer of knowledge to both previous tasks and future ones. Recent research has shown that self-supervision can produce versatile models that can generalize well to diverse downstream tasks. We propose Continual Learning via Equivariant Regularization (CLER), an OCL approach that leverages equivariant tasks for self-supervision.
arXiv Detail & Related papers (2023-05-05T16:10:31Z)
Stabilizing and Improving Federated Learning with Non-IID Data and Client Dropout [15.569507252445144]
Label distribution skew induced data heterogeniety has been shown to be a significant obstacle that limits the model performance in federated learning. We propose a simple yet effective framework by introducing a prior-calibrated softmax function for computing the cross-entropy loss. The improved model performance over existing baselines in the presence of non-IID data and client dropout is demonstrated.
arXiv Detail & Related papers (2023-03-11T05:17:59Z)
Learning Deep Representations via Contrastive Learning for Instance Retrieval [11.736450745549792]
This paper makes the first attempt that tackles the problem using instance-discrimination based contrastive learning (CL) In this work, we approach this problem by exploring the capability of deriving discriminative representations from pre-trained and fine-tuned CL models.
arXiv Detail & Related papers (2022-09-28T04:36:34Z)
Beyond Supervised Continual Learning: a Review [69.9674326582747]
Continual Learning (CL) is a flavor of machine learning where the usual assumption of stationary data distribution is relaxed or omitted. Changes in the data distribution can cause the so-called catastrophic forgetting (CF) effect: an abrupt loss of previous knowledge. This article reviews literature that study CL in other settings, such as learning with reduced supervision, fully unsupervised learning, and reinforcement learning.
arXiv Detail & Related papers (2022-08-30T14:44:41Z)
Using Representation Expressiveness and Learnability to Evaluate Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability. CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means. We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z)
SLIP: Self-supervision meets Language-Image Pre-training [79.53764315471543]
We study whether self-supervised learning can aid in the use of language supervision for visual representation learning. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. We find that SLIP enjoys the best of both worlds: better performance than self-supervision and language supervision.
arXiv Detail & Related papers (2021-12-23T18:07:13Z)
Self-supervised Learning is More Robust to Dataset Imbalance [65.84339596595383]
We investigate self-supervised learning under dataset imbalance. Off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. We devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets.
arXiv Detail & Related papers (2021-10-11T06:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.