Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs
- URL: http://arxiv.org/abs/2511.10850v1
- Date: Thu, 13 Nov 2025 23:20:57 GMT
- Title: Leveraging Parameter Space Symmetries for Reasoning Skill Transfer in LLMs
- Authors: Stefan Horoi, Sangwoo Cho, Supriyo Chakraborty, Shi-Xiong Zhang, Sambit Sahu, Guy Wolf, Genta Indra Winata,
- Abstract summary: Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs)<n>We first align the models' parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures.<n>We successfully transfer advanced reasoning skills to a non-reasoning model.
- Score: 27.978175136002005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Task arithmetic is a powerful technique for transferring skills between Large Language Models (LLMs), but it often suffers from negative interference when models have diverged during training. We address this limitation by first aligning the models' parameter spaces, leveraging the inherent permutation, rotation, and scaling symmetries of Transformer architectures. We adapt parameter space alignment for modern Grouped-Query Attention (GQA) and SwiGLU layers, exploring both weight-based and activation-based approaches. Using this alignment-first strategy, we successfully transfer advanced reasoning skills to a non-reasoning model. Experiments on challenging reasoning benchmarks show that our method consistently outperforms standard task arithmetic. This work provides an effective approach for merging and transferring specialized skills across evolving LLM families, reducing redundant fine-tuning and enhancing model adaptability.
Related papers
- Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations [55.047454145941366]
Streaming Merging is an innovative model updating paradigm that conceptualizes merging as an iterative optimization process.<n> ARM is a strategy designed to approximate gradient descent dynamics.<n> ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model.
arXiv Detail & Related papers (2026-02-03T08:15:57Z) - Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z) - Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations [50.010924231754856]
Adapting pre-trained foundation models for diverse downstream tasks is a core practice in artificial intelligence.<n>To overcome this, parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are becoming a growing research focus.<n>We propose a generalization that extends matrix-based PEFT methods to higher-dimensional parameter spaces without compromising their structural properties.
arXiv Detail & Related papers (2025-04-01T14:36:45Z) - Sens-Merging: Sensitivity-Guided Parameter Balancing for Merging Large Language Models [20.741460682103863]
Sens-Merging is a sensitivity-guided coefficient adjustment method for model merging.<n>We show that Sens-Merging significantly improves performance across general knowledge, mathematical reasoning, and code generation tasks.<n>Our findings reveal important trade-offs between task-specific and cross-task scalings, providing insights for future model merging strategies.
arXiv Detail & Related papers (2025-02-18T01:41:13Z) - How to Alleviate Catastrophic Forgetting in LLMs Finetuning? Hierarchical Layer-Wise and Element-Wise Regularization [15.434072331989878]
Large Language Models (LLMs) exhibit strong general language capabilities.<n>Fine-tuning these models on domain-specific tasks often leads to catastrophic forgetting, where the model overwrites or loses essential knowledge acquired during pretraining.<n>We propose a novel approach to compute the element-wise importance of model parameters crucial for preserving general knowledge during fine-tuning.
arXiv Detail & Related papers (2025-01-23T13:54:53Z) - Merging Models on the Fly Without Retraining: A Sequential Approach to Scalable Continual Model Merging [75.93960998357812]
Deep model merging represents an emerging research direction that combines multiple fine-tuned models to harness their capabilities across different tasks and domains.<n>Current model merging techniques focus on merging all available models simultaneously, with weight matrices-based methods being the predominant approaches.<n>We propose a training-free projection-based continual merging method that processes models sequentially.
arXiv Detail & Related papers (2025-01-16T13:17:24Z) - Transformer-Squared: Self-adaptive LLMs [29.1326358746118]
We introduce Transformer-Squared, a novel self-adaptation framework that adapts large language models for unseen tasks in real-time.<n>Our method consistently outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency.<n> Transformer-Squared represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs.
arXiv Detail & Related papers (2025-01-09T01:19:21Z) - Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation [33.05581803204543]
Adapting pre-trained large language models (LLMs) is crucial but challenging due to their enormous size.<n>We introduce SketchTune, a compressive adaptation strategy that compresses weights into compact fine-tunable sketches.<n>SketchTune is supported by mathematical insights into matrix classes that are better approximated using sketching rather than low-rank methods.
arXiv Detail & Related papers (2024-10-08T20:58:24Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.