Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
- URL: http://arxiv.org/abs/2510.09658v2
- Date: Thu, 16 Oct 2025 16:13:33 GMT
- Title: Gradient-Sign Masking for Task Vector Transport Across Pre-Trained Models
- Authors: Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Fengyuan Liu, Marco Ciccone, Angelo Porrello, Simone Calderara,
- Abstract summary: We show that the key to successful transfer lies in the sign structure of the gradients of the new model.<n>We propose GradFix, a novel method that approximates the ideal gradient sign structure.<n>We demonstrate significant performance gains on vision and language benchmarks.
- Score: 25.83401080149413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When a new release of a foundation model is published, practitioners typically need to repeat full fine-tuning, even if the same task has already been solved in the previous version. A promising alternative is to reuse the parameter changes (i.e., task vectors) that capture how a model adapts to a specific task. However, they often fail to transfer across different pre-trained models due to their misaligned parameter space. In this work, we show that the key to successful transfer lies in the sign structure of the gradients of the new model. Based on this insight, we propose GradFix, a novel method that approximates the ideal gradient sign structure and leverages it to transfer knowledge using only a handful of labeled samples. Notably, this requires no additional fine-tuning: the adaptation is achieved by computing a few gradients at the target model and masking the source task vector accordingly. This yields an update that is locally aligned with the target loss landscape, effectively rebasing the task vector onto the new pre-training. We provide a theoretical guarantee that our method ensures first-order descent. Empirically, we demonstrate significant performance gains on vision and language benchmarks, consistently outperforming naive task vector addition and few-shot fine-tuning.
Related papers
- Purifying Task Vectors in Knowledge-Aware Subspace for Model Merging [83.5273168208788]
Model merging aims to integrate task-specific abilities from individually fine-tuned models into a single model without extra training.<n>The merged model often suffers from notable performance degradation due to the conflicts caused by task-irrelevant redundancy in task vectors.<n>We propose Purifying TAsk Vectors (PAVE) in knowledge-aware subspace to overcome these challenges.
arXiv Detail & Related papers (2025-10-16T14:02:57Z) - Update Your Transformer to the Latest Release: Re-Basin of Task Vectors [27.63078324151366]
Foundation models serve as the backbone for numerous specialized models developed through fine-tuning.<n>When the underlying pretrained model is updated or retrained, the fine-tuned model becomes obsolete.<n>This raises the question: is it possible to transfer fine-tuning to a new release of the model?<n>In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner.
arXiv Detail & Related papers (2025-05-28T13:55:12Z) - Cross-Model Transfer of Task Vectors via Few-Shot Orthogonal Alignment [5.2980803808373516]
Task arithmetic enables efficient model editing by representing task-specific changes as vectors in parameter space.<n>This assumption limits its applicability in cross-model transfer settings, where models are independently pre-trained on different datasets.<n>We propose a method based on few-shot alignment, which aligns task vectors to the parameter space of a differently pre-trained target model.
arXiv Detail & Related papers (2025-05-17T14:24:06Z) - Pre-Trained Model Recommendation for Downstream Fine-tuning [22.343011779348682]
Model selection aims to rank off-the-shelf pre-trained models and select the most suitable one for the new target task.
Existing model selection techniques are often constrained in their scope and tend to overlook the nuanced relationships between models and tasks.
We present a pragmatic framework textbfFennec, delving into a diverse, large-scale model repository.
arXiv Detail & Related papers (2024-03-11T02:24:32Z) - Understanding the Transferability of Representations via Task-Relatedness [8.425690424016986]
We propose a novel analysis that analyzes the transferability of the representations of pre-trained models to downstream tasks in terms of their relatedness to a given reference task.
Our experiments using state-of-the-art pre-trained models show the effectiveness of task-relatedness in explaining transferability on various vision and language tasks.
arXiv Detail & Related papers (2023-07-03T08:06:22Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - Voting from Nearest Tasks: Meta-Vote Pruning of Pre-trained Models for
Downstream Tasks [55.431048995662714]
We create a small model for a new task from the pruned models of similar tasks.
We show that a few fine-tuning steps on this model suffice to produce a promising pruned-model for the new task.
We develop a simple but effective ''Meta-Vote Pruning (MVP)'' method that significantly reduces the pruning iterations for a new task.
arXiv Detail & Related papers (2023-01-27T06:49:47Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Parameter-Efficient Transfer Learning with Diff Pruning [108.03864629388404]
diff pruning is a simple approach to enable parameter-efficient transfer learning within the pretrain-finetune framework.
We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark.
arXiv Detail & Related papers (2020-12-14T12:34:01Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.