Related papers: LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging

LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging

URL: http://arxiv.org/abs/2602.09413v1
Date: Tue, 10 Feb 2026 05:10:31 GMT
Title: LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging
Authors: Xinyu Wang, Ke Deng, Fei Dou, Jinbo Bi, Jin Lu,
Abstract summary: We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer.<n>LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule.<n> Layerwise analysis and corruption tests indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features.
Score: 11.135582038431368
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.

Related papers

Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models [51.754991950934375]
In a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks.<n>We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks.<n>We propose TaLo, a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task.
arXiv Detail & Related papers (2026-02-01T11:37:05Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation [3.3834108313265916]
We propose Hierarchical Adaptive Networks with Task Vectors (Hi-Vec)<n>Hi-Vec allows existing methods to adapt to shifts of varying complexity.<n>We rigorously evaluate the performance of Hi-Vec in challenging scenarios and on multiple target datasets.
arXiv Detail & Related papers (2025-08-11T21:55:53Z)
MASS: MoErging through Adaptive Subspace Selection [55.03293736484465]
We present MASS (MoErging through Adaptive Subspace Selection), a new approach to model merging.<n> MASS stores only the most salient singular components for each task and merges them into a shared model.<n>We evaluate MASS on CLIP-based image classification using ViT-B-16, ViT-B-32 and ViT-L-14 for benchmarks of 8, 14 and 20 tasks respectively.
arXiv Detail & Related papers (2025-04-06T08:49:52Z)
LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging [80.17238673443127]
LiNeS is a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance.<n>LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing.
arXiv Detail & Related papers (2024-10-22T16:26:05Z)
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction [16.84400858871298]
We propose FiRST, an algorithm that reduces latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence.<n>FiRST preserves compatibility with KV caching enabling faster inference while being quality-aware.<n>Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on tasks.
arXiv Detail & Related papers (2024-10-16T12:45:35Z)
Merging Vision Transformers from Different Tasks and Domains [46.40701388197936]
This work targets to merge various Vision Transformers (ViTs) trained on different tasks (i.e., datasets with different object categories) or domains (i.e., datasets with the same categories but different environments) into one unified model. Previous model merging works focus on either CNNs or NLP models, leaving the ViTs merging research untouched.
arXiv Detail & Related papers (2023-12-25T09:32:28Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
Hierarchical Side-Tuning for Vision Transformers [33.536948382414316]
Fine-tuning pre-trained Vision Transformers (ViTs) has showcased significant promise in enhancing visual recognition tasks. PETL has shown potential for achieving high performance with fewer parameter updates compared to full fine-tuning. This paper introduces Hierarchical Side-Tuning (HST), an innovative PETL method facilitating the transfer of ViT models to diverse downstream tasks.
arXiv Detail & Related papers (2023-10-09T04:16:35Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
BERMo: What can BERT learn from ELMo? [6.417011237981518]
We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
arXiv Detail & Related papers (2021-10-18T17:35:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.