Patching open-vocabulary models by interpolating weights
- URL: http://arxiv.org/abs/2208.05592v1
- Date: Wed, 10 Aug 2022 23:47:43 GMT
- Title: Patching open-vocabulary models by interpolating weights
- Authors: Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song,
Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, Ludwig Schmidt
- Abstract summary: Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks.
We study model patching, where the goal is to improve accuracy on specific tasks without degrading accuracy on tasks where performance is already adequate.
Our findings demonstrate that it is possible to expand the set of tasks on which open-vocabulary models achieve high accuracy without re-training them from scratch.
- Score: 85.12977566514984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-vocabulary models like CLIP achieve high accuracy across many image
classification tasks. However, there are still settings where their zero-shot
performance is far from optimal. We study model patching, where the goal is to
improve accuracy on specific tasks without degrading accuracy on tasks where
performance is already adequate. Towards this goal, we introduce PAINT, a
patching method that uses interpolations between the weights of a model before
fine-tuning and the weights after fine-tuning on a task to be patched. On nine
tasks where zero-shot CLIP performs poorly, PAINT increases accuracy by 15 to
60 percentage points while preserving accuracy on ImageNet within one
percentage point of the zero-shot model. PAINT also allows a single model to be
patched on multiple tasks and improves with model scale. Furthermore, we
identify cases of broad transfer, where patching on one task increases accuracy
on other tasks even when the tasks have disjoint classes. Finally, we
investigate applications beyond common benchmarks such as counting or reducing
the impact of typographic attacks on CLIP. Our findings demonstrate that it is
possible to expand the set of tasks on which open-vocabulary models achieve
high accuracy without re-training them from scratch.
Related papers
- Patch Ranking: Efficient CLIP by Learning to Rank Local Patches [11.225834286969283]
Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP.
We propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking.
We successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3 across seven datasets.
arXiv Detail & Related papers (2024-09-22T22:04:26Z) - Transductive Zero-Shot and Few-Shot CLIP [24.592841797020203]
This paper addresses the transductive zero-shot and few-shot CLIP classification challenge.
Inference is performed jointly across a mini-batch of unlabeled query samples, rather than treating each instance independently.
Our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance.
arXiv Detail & Related papers (2024-04-08T12:44:31Z) - Enhancing Few-shot CLIP with Semantic-Aware Fine-Tuning [61.902254546858465]
Methods based on Contrastive Language-Image Pre-training have exhibited promising performance in few-shot adaptation tasks.
We propose fine-tuning the parameters of the attention pooling layer during the training process to encourage the model to focus on task-specific semantics.
arXiv Detail & Related papers (2023-11-08T05:18:57Z) - Task-Specific Skill Localization in Fine-tuned Language Models [36.53572616441048]
This paper introduces the term skill localization for this problem.
A simple optimization is used to identify a very small subset of parameters.
grafting the fine-tuned values for just this tiny subset onto the pre-trained model gives performance almost as well as the fine-tuned model.
arXiv Detail & Related papers (2023-02-13T18:55:52Z) - Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks.
We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts.
We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z) - Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z) - Rethinking Keypoint Representations: Modeling Keypoints and Poses as
Objects for Multi-Person Human Pose Estimation [79.78017059539526]
We propose a new heatmap-free keypoint estimation method in which individual keypoints and sets of spatially related keypoints (i.e., poses) are modeled as objects within a dense single-stage anchor-based detection framework.
In experiments, we observe that KAPAO is significantly faster and more accurate than previous methods, which suffer greatly from heatmap post-processing.
Our large model, KAPAO-L, achieves an AP of 70.6 on the Microsoft COCO Keypoints validation set without test-time augmentation.
arXiv Detail & Related papers (2021-11-16T15:36:44Z) - Rectification-based Knowledge Retention for Continual Learning [49.1447478254131]
Deep learning models suffer from catastrophic forgetting when trained in an incremental learning setting.
We propose a novel approach to address the task incremental learning problem, which involves training a model on new tasks that arrive in an incremental manner.
Our approach can be used in both the zero-shot and non zero-shot task incremental learning settings.
arXiv Detail & Related papers (2021-03-30T18:11:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.