SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
- URL: http://arxiv.org/abs/2211.16198v4
- Date: Tue, 15 Aug 2023 13:31:15 GMT
- Title: SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
- Authors: Vishaal Udandarao, Ankush Gupta, Samuel Albanie
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models.
Fine-tuning the entire CLIP model can be resource-intensive and unstable.
We propose a novel method, SuS-X, that requires neither intensive fine-tuning nor costly labelled data.
- Score: 28.06403983530132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet
effective way to train large-scale vision-language models. CLIP demonstrates
impressive zero-shot classification and retrieval on diverse downstream tasks.
However, to leverage its full potential, fine-tuning still appears to be
necessary. Fine-tuning the entire CLIP model can be resource-intensive and
unstable. Moreover, recent methods that aim to circumvent this need for
fine-tuning still require access to images from the target distribution. In
this paper, we pursue a different approach and explore the regime of
training-free "name-only transfer" in which the only knowledge we possess about
the downstream task comprises the names of downstream target categories. We
propose a novel method, SuS-X, consisting of two key building blocks -- SuS and
TIP-X, that requires neither intensive fine-tuning nor costly labelled data.
SuS-X achieves state-of-the-art zero-shot classification results on 19
benchmark datasets. We further show the utility of TIP-X in the training-free
few-shot setting, where we again achieve state-of-the-art results over strong
training-free baselines. Code is available at
https://github.com/vishaal27/SuS-X.
Related papers
- Can Graph Neural Networks Learn Language with Extremely Weak Text Supervision? [62.12375949429938]
Building transferable Graph Neural Networks (GNNs) with CLIP pipeline is challenging because of three fundamental issues.
We leverage multi-modal prompt learning to effectively adapt pre-trained GNN to downstream tasks and data.
Our new paradigm embeds the graphs directly in the same space as the Large Language Models (LLMs) by learning both graph prompts and text prompts simultaneously.
arXiv Detail & Related papers (2024-12-11T08:03:35Z) - GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs [27.169892145194638]
GraphCLIP is a framework to learn graph foundation models with strong cross-domain zero/few-shot transferability.
We generate and curate large-scale graph-summary pair data with the assistance of LLMs.
For few-shot learning, we propose a novel graph prompt tuning technique aligned with our pretraining objective.
arXiv Detail & Related papers (2024-10-14T09:40:52Z) - Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle.
Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data [122.282521548393]
Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning.
We introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Preventing Zero-Shot Transfer Degradation in Continual Learning of
Vision-Language Models [13.340759455910721]
We propose a novel method to prevent zero-shot transfer degradation in the continual learning of vision-language models.
Our method outperforms other methods in the traditional class-incremental learning setting.
arXiv Detail & Related papers (2023-03-12T10:28:07Z) - SGL-PT: A Strong Graph Learner with Graph Prompt Tuning [36.650472660276]
We propose a novel framework named SGL-PT which follows the learning strategy Pre-train, Prompt, and Predict''.
Specifically, we raise a strong and universal pre-training task coined as SGL that acquires the complementary merits of generative and contrastive self-supervised graph learning.
And aiming for graph classification task, we unify pre-training and fine-tuning by designing a novel verbalizer-free prompting function, which reformulates the downstream task in a similar format as pretext task.
arXiv Detail & Related papers (2023-02-24T04:31:18Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - DATA: Domain-Aware and Task-Aware Pre-training [94.62676913928831]
We present DATA, a simple yet effective NAS approach specialized for self-supervised learning (SSL)
Our method achieves promising results across a wide range of computation costs on downstream tasks, including image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2022-03-17T02:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.