Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time
- URL: http://arxiv.org/abs/2203.05482v1
- Date: Thu, 10 Mar 2022 17:03:49 GMT
- Title: Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time
- Authors: Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca
Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali
Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt
- Abstract summary: We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
- Score: 69.7693300927423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The conventional recipe for maximizing model accuracy is to (1) train
multiple models with various hyperparameters and (2) pick the individual model
which performs best on a held-out validation set, discarding the remainder. In
this paper, we revisit the second step of this procedure in the context of
fine-tuning large pre-trained models, where fine-tuned models often appear to
lie in a single low error basin. We show that averaging the weights of multiple
models fine-tuned with different hyperparameter configurations often improves
accuracy and robustness. Unlike a conventional ensemble, we may average many
models without incurring any additional inference or memory costs -- we call
the results "model soups." When fine-tuning large pre-trained models such as
CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides
significant improvements over the best model in a hyperparameter sweep on
ImageNet. As a highlight, the resulting ViT-G model attains 90.94% top-1
accuracy on ImageNet, a new state of the art. Furthermore, we show that the
model soup approach extends to multiple image classification and natural
language processing tasks, improves out-of-distribution performance, and
improves zero-shot performance on new downstream tasks. Finally, we
analytically relate the performance similarity of weight-averaging and
logit-ensembling to flatness of the loss and confidence of the predictions, and
validate this relation empirically.
Related papers
- Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation [36.45488536471859]
Similarity refines the image-image similarity by using unlabeled images.
Weight introduces a precision matrix into the weight function to adequately model the relation between training samples.
To reduce the high complexity of GPs, we propose a group-based learning strategy.
arXiv Detail & Related papers (2024-10-11T15:12:30Z) - Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think [53.2706196341054]
We show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed.
We perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models.
arXiv Detail & Related papers (2024-09-17T16:58:52Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Model Stock: All we need is just a few fine-tuned models [34.449901046895185]
This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance.
We employ significantly fewer models to achieve final weights yet yield superior accuracy.
We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures.
arXiv Detail & Related papers (2024-03-28T15:57:20Z) - Do the Frankenstein, or how to achieve better out-of-distribution
performance with manifold mixing model soup [1.0878040851637998]
We show that the fused model gives significantly better out-of-distribution performance when finetuning a CLIP model for image classification.
It provides also better accuracy on the original dataset where the finetuning has been done.
arXiv Detail & Related papers (2023-08-28T06:13:32Z) - Precision-Recall Divergence Optimization for Generative Modeling with
GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows.
We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences.
Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z) - CAMERO: Consistency Regularized Ensemble of Perturbed Language Models
with Weight Sharing [83.63107444454938]
We propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO.
Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity.
Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model.
arXiv Detail & Related papers (2022-04-13T19:54:51Z) - No One Representation to Rule Them All: Overlapping Features of Training
Methods [12.58238785151714]
High-performing models tend to make similar predictions regardless of training methodology.
Recent work has made very different training techniques, such as large-scale contrastive learning, yield competitively-high accuracy.
We show these models specialize in generalization of the data, leading to higher ensemble performance.
arXiv Detail & Related papers (2021-10-20T21:29:49Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.