Related papers: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

URL: http://arxiv.org/abs/2203.05482v1
Date: Thu, 10 Mar 2022 17:03:49 GMT
Title: Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
Authors: Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt
Abstract summary: We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness. We show that the model soup approach extends to multiple image classification and natural language processing tasks.
Score: 69.7693300927423
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs -- we call the results "model soups." When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. As a highlight, the resulting ViT-G model attains 90.94% top-1 accuracy on ImageNet, a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically.

Related papers

Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation [36.45488536471859]
Similarity refines the image-image similarity by using unlabeled images. Weight introduces a precision matrix into the weight function to adequately model the relation between training samples. To reduce the high complexity of GPs, we propose a group-based learning strategy.
arXiv Detail & Related papers (2024-10-11T15:12:30Z)
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think [53.2706196341054]
We show that the perceived inefficiency was caused by a flaw in the inference pipeline that has so far gone unnoticed. We perform end-to-end fine-tuning on top of the single-step model with task-specific losses and get a deterministic model that outperforms all other diffusion-based depth and normal estimation models.
arXiv Detail & Related papers (2024-09-17T16:58:52Z)
EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z)
Model Stock: All we need is just a few fine-tuned models [34.449901046895185]
This paper introduces an efficient fine-tuning method for large pre-trained models, offering strong in-distribution (ID) and out-of-distribution (OOD) performance. We employ significantly fewer models to achieve final weights yet yield superior accuracy. We demonstrate the efficacy of Model Stock with fine-tuned models based upon pre-trained CLIP architectures.
arXiv Detail & Related papers (2024-03-28T15:57:20Z)
Do the Frankenstein, or how to achieve better out-of-distribution performance with manifold mixing model soup [1.0878040851637998]
We show that the fused model gives significantly better out-of-distribution performance when finetuning a CLIP model for image classification. It provides also better accuracy on the original dataset where the finetuning has been done.
arXiv Detail & Related papers (2023-08-28T06:13:32Z)
Precision-Recall Divergence Optimization for Generative Modeling with GANs and Normalizing Flows [54.050498411883495]
We develop a novel training method for generative models, such as Generative Adversarial Networks and Normalizing Flows. We show that achieving a specified precision-recall trade-off corresponds to minimizing a unique $f$-divergence from a family we call the textitPR-divergences. Our approach improves the performance of existing state-of-the-art models like BigGAN in terms of either precision or recall when tested on datasets such as ImageNet.
arXiv Detail & Related papers (2023-05-30T10:07:17Z)
CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing [83.63107444454938]
We propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO. Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity. Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model.
arXiv Detail & Related papers (2022-04-13T19:54:51Z)
No One Representation to Rule Them All: Overlapping Features of Training Methods [12.58238785151714]
High-performing models tend to make similar predictions regardless of training methodology. Recent work has made very different training techniques, such as large-scale contrastive learning, yield competitively-high accuracy. We show these models specialize in generalization of the data, leading to higher ensemble performance.
arXiv Detail & Related papers (2021-10-20T21:29:49Z)
Model Compression for Domain Adaptation through Causal Effect Estimation [20.842938440720303]
ATE-guided Model Compression scheme (AMoC) generates many model candidates, differing by the model components that were removed. Then, we select the best candidate through a stepwise regression model that utilizes the ATE to predict the expected performance on the target domain. AMoC outperforms strong baselines on 46 of 60 domain pairs across two text classification tasks, with an average improvement of more than 3% in F1 above the strongest baseline.
arXiv Detail & Related papers (2021-01-18T14:18:02Z)
When Ensembling Smaller Models is More Efficient than Single Large Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.