SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
- URL: http://arxiv.org/abs/2211.16198v4
- Date: Tue, 15 Aug 2023 13:31:15 GMT
- Title: SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
- Authors: Vishaal Udandarao, Ankush Gupta, Samuel Albanie
- Abstract summary: Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models.
Fine-tuning the entire CLIP model can be resource-intensive and unstable.
We propose a novel method, SuS-X, that requires neither intensive fine-tuning nor costly labelled data.
- Score: 28.06403983530132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet
effective way to train large-scale vision-language models. CLIP demonstrates
impressive zero-shot classification and retrieval on diverse downstream tasks.
However, to leverage its full potential, fine-tuning still appears to be
necessary. Fine-tuning the entire CLIP model can be resource-intensive and
unstable. Moreover, recent methods that aim to circumvent this need for
fine-tuning still require access to images from the target distribution. In
this paper, we pursue a different approach and explore the regime of
training-free "name-only transfer" in which the only knowledge we possess about
the downstream task comprises the names of downstream target categories. We
propose a novel method, SuS-X, consisting of two key building blocks -- SuS and
TIP-X, that requires neither intensive fine-tuning nor costly labelled data.
SuS-X achieves state-of-the-art zero-shot classification results on 19
benchmark datasets. We further show the utility of TIP-X in the training-free
few-shot setting, where we again achieve state-of-the-art results over strong
training-free baselines. Code is available at
https://github.com/vishaal27/SuS-X.
Related papers
- GraphCLIP: Enhancing Transferability in Graph Foundation Models for Text-Attributed Graphs [27.169892145194638]
GraphCLIP is a framework to learn graph foundation models with strong cross-domain zero/few-shot transferability.
We generate and curate large-scale graph-summary pair data with the assistance of LLMs.
For few-shot learning, we propose a novel graph prompt tuning technique aligned with our pretraining objective.
arXiv Detail & Related papers (2024-10-14T09:40:52Z) - Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model [43.738677778740325]
We propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle.
Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets.
arXiv Detail & Related papers (2024-06-18T14:07:13Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [111.49781716597984]
We propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
We can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
arXiv Detail & Related papers (2023-04-06T18:00:04Z) - Preventing Zero-Shot Transfer Degradation in Continual Learning of
Vision-Language Models [13.340759455910721]
We propose a novel method to prevent zero-shot transfer degradation in the continual learning of vision-language models.
Our method outperforms other methods in the traditional class-incremental learning setting.
arXiv Detail & Related papers (2023-03-12T10:28:07Z) - SGL-PT: A Strong Graph Learner with Graph Prompt Tuning [36.650472660276]
We propose a novel framework named SGL-PT which follows the learning strategy Pre-train, Prompt, and Predict''.
Specifically, we raise a strong and universal pre-training task coined as SGL that acquires the complementary merits of generative and contrastive self-supervised graph learning.
And aiming for graph classification task, we unify pre-training and fine-tuning by designing a novel verbalizer-free prompting function, which reformulates the downstream task in a similar format as pretext task.
arXiv Detail & Related papers (2023-02-24T04:31:18Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - DATA: Domain-Aware and Task-Aware Pre-training [94.62676913928831]
We present DATA, a simple yet effective NAS approach specialized for self-supervised learning (SSL)
Our method achieves promising results across a wide range of computation costs on downstream tasks, including image classification, object detection and semantic segmentation.
arXiv Detail & Related papers (2022-03-17T02:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.