Related papers: Update Your Transformer to the Latest Release: Re-Basin of Task Vectors

Update Your Transformer to the Latest Release: Re-Basin of Task Vectors

URL: http://arxiv.org/abs/2505.22697v1
Date: Wed, 28 May 2025 13:55:12 GMT
Title: Update Your Transformer to the Latest Release: Re-Basin of Task Vectors
Authors: Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele Rodolà, Simone Calderara, Angelo Porrello,
Abstract summary: Foundation models serve as the backbone for numerous specialized models developed through fine-tuning.<n>When the underlying pretrained model is updated or retrained, the fine-tuned model becomes obsolete.<n>This raises the question: is it possible to transfer fine-tuning to a new release of the model?<n>In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner.
Score: 27.63078324151366
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available at https://github.com/aimagelab/TransFusion.

Related papers

MERGETUNE: Continued fine-tuning of vision-language models [77.8627788911249]
Fine-tuning vision-language models (VLMs) often leads to catastrophic forgetting of pretrained knowledge.<n>We introduce a novel paradigm, continued fine-tuning (CFT), which seeks to recover pretrained knowledge after a zero-shot model has already been adapted.
arXiv Detail & Related papers (2026-01-15T15:15:53Z)
Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models [25.83401080149413]
We show that the key to successful transfer lies in the sign structure of the gradients of the new model.<n>We propose GradFix, a novel method that approximates the ideal gradient sign structure.<n>We demonstrate significant performance gains on vision and language benchmarks.
arXiv Detail & Related papers (2025-10-07T13:30:25Z)
Weight subcloning: direct initialization of transformers using larger pretrained ones [42.056148990349094]
We introduce a technique to transfer the knowledge of a pretrained model to smaller variants. Weight subcloning expedites the training of scaled-down transformers by initializing their weights from larger pretrained models. We achieve 4x faster training for vision transformers in image classification and language models designed for next token prediction.
arXiv Detail & Related papers (2023-12-14T19:08:56Z)
Initializing Models with Larger Ones [76.41561758293055]
We introduce weight selection, a method for initializing smaller models by selecting a subset of weights from a pretrained larger model. Our experiments demonstrate that weight selection can significantly enhance the performance of small models and reduce their training time.
arXiv Detail & Related papers (2023-11-30T18:58:26Z)
Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model [74.62272538148245]
We show that for arbitrary pairings of pretrained models, one model extracts significant data context unavailable in the other. We investigate if it is possible to transfer such "complementary" knowledge from one model to another without performance degradation.
arXiv Detail & Related papers (2023-10-26T17:59:46Z)
MixBCT: Towards Self-Adapting Backward-Compatible Training [66.52766344751635]
We propose MixBCT, a simple yet highly effective backward-compatible training method. We conduct experiments on the large-scale face recognition datasets MS1Mv3 and IJB-C.
arXiv Detail & Related papers (2023-08-14T05:55:38Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
$\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time. We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies. Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z)
Re-parameterizing Your Optimizers rather than Architectures [119.08740698936633]
We propose a novel paradigm of incorporating model-specific prior knowledge into Structurals and using them to train generic (simple) models. As an implementation, we propose a novel methodology to add prior knowledge by modifying the gradients according to a set of model-specific hyper- parameters. For a simple model trained with a Repr, we focus on a VGG-style plain model and showcase that such a simple model trained with a Repr, which is referred to as Rep-VGG, performs on par with the recent well-designed models.
arXiv Detail & Related papers (2022-05-30T16:55:59Z)
Forward Compatible Training for Representation Learning [53.300192863727226]
backward compatible training (BCT) modifies training of the new model to make its representations compatible with those of the old model. BCT can significantly hinder the performance of the new model. In this work, we propose a new learning paradigm for representation learning: forward compatible training (FCT)
arXiv Detail & Related papers (2021-12-06T06:18:54Z)
Bridging Pre-trained Models and Downstream Tasks for Source Code Understanding [13.65914588243695]
We propose an approach to bridge pre-trained models and code-related tasks. We exploit semantic-preserving transformation to enrich downstream data diversity. We introduce curriculum learning to organize the transformed data in an easy-to-hard manner to fine-tune existing pre-trained models.
arXiv Detail & Related papers (2021-12-04T07:21:28Z)
Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity. We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z)
Neural Network Retraining for Model Serving [32.857847595096025]
We propose incremental (re)training of a neural network model to cope with a continuous flow of new data in inference. We address two challenges of life-long retraining: catastrophic forgetting and efficient retraining.
arXiv Detail & Related papers (2020-04-29T13:52:28Z)
Renofeation: A Simple Transfer Learning Method for Improved Adversarial Robustness [26.73248223512572]
A recent adversarial attack can successfully deceive models trained with transfer learning via end-to-end fine-tuning. This raises security concerns for many industrial applications. We propose noisy feature distillation, a new transfer learning method.
arXiv Detail & Related papers (2020-02-07T20:07:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.