HyperCLIP: Adapting Vision-Language models with Hypernetworks
- URL: http://arxiv.org/abs/2412.16777v1
- Date: Sat, 21 Dec 2024 21:19:08 GMT
- Title: HyperCLIP: Adapting Vision-Language models with Hypernetworks
- Authors: Victor Akinwande, Mohammad Sadegh Norouzzadeh, Devin Willmott, Anna Bair, Madan Ravi Ganesh, J. Zico Kolter,
- Abstract summary: We propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork.
All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end.
HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.
- Score: 43.23792024551352
- License:
- Abstract: Self-supervised vision-language models trained with contrastive objectives form the basis of current state-of-the-art methods in AI vision tasks. The success of these models is a direct consequence of the huge web-scale datasets used to train them, but they require correspondingly large vision components to properly learn powerful and general representations from such a broad data domain. This poses a challenge for deploying large vision-language models, especially in resource-constrained environments. To address this, we propose an alternate vision-language architecture, called HyperCLIP, that uses a small image encoder along with a hypernetwork that dynamically adapts image encoder weights to each new set of text inputs. All three components of the model (hypernetwork, image encoder, and text encoder) are pre-trained jointly end-to-end, and with a trained HyperCLIP model, we can generate new zero-shot deployment-friendly image classifiers for any task with a single forward pass through the text encoder and hypernetwork. HyperCLIP increases the zero-shot accuracy of SigLIP trained models with small image encoders by up to 3% on ImageNet and 5% on CIFAR-100 with minimal training throughput overhead.
Related papers
- LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task [0.0]
This research explores the development of vision-language models for image retrieval in low-resource languages, specifically Azerbaijani.
We integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance.
Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on.
arXiv Detail & Related papers (2024-08-25T18:10:16Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - Controlling Vision-Language Models for Multi-Task Image Restoration [6.239038964461397]
We present a degradation-aware vision-language model (DA-CLIP) to better transfer pretrained vision-language models to low-level vision tasks.
Our approach advances state-of-the-art performance on both emphdegradation-specific and emphunified image restoration tasks.
arXiv Detail & Related papers (2023-10-02T09:10:16Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - Unsupervised Prompt Learning for Vision-Language Models [12.259694415428026]
We propose an unsupervised prompt learning (UPL) framework to improve the zero-shot transfer of CLIP-like vision-language models.
An enhanced version of UPL is even on par with the 8-shot CoOp and the 8-shot TIP-Adapter on most datasets.
arXiv Detail & Related papers (2022-04-07T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.