Related papers: PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint

PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint

URL: http://arxiv.org/abs/2509.10971v1
Date: Sat, 13 Sep 2025 20:13:58 GMT
Title: PHLoRA: data-free Post-hoc Low-Rank Adapter extraction from full-rank checkpoint
Authors: Bhoomit Vasani, Jack FitzGerald, Anjie Fang, Sushmit Vaish,
Abstract summary: We introduce PHLoRA, a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models.<n>Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation.<n>Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged.
Score: 3.5840378192062956
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We introduce PHLoRA (Pronounced "flora"). (Post-hoc LoRA), a simple yet powerful method to extract low-rank adaptation adapters from full-rank fine-tuned models without requiring access to training data or gradients. By computing the low-rank decomposition of weight differences between a base model and its fine-tuned counterpart, our method reconstructs adapter modules that can be merged or dynamically routed at inference time via S-LoRA, or served in scalable, industry settings using platforms like NVIDIA NIM. This approach amortizes latency overhead across requests and yields substantial cost savings. Unlike prior work that trains each adapter explicitly, our approach decouples fine-tuning from adapter generation, allowing adapter extraction from existing full-rank models or third-party checkpoints. Experiments on text, image, and video benchmarks using the Amazon Nova model family demonstrate that extracted adapters preserve high energy from the full weight delta, can be pruned safely, and yield negligible degradation in downstream task performance when re-merged. Overall, PHLoRA provides a practical path for making all existing full-rank checkpoints adapter-ready, democratizing scalable inference for all models.

Related papers

PVeRA: Probabilistic Vector-Based Random Matrix Adaptation [6.460933909139705]
We propose PVeRA, a probabilistic version of the VeRA adapter, which modifies the low-rank matrices of VeRA in a probabilistic manner.<n>A comprehensive evaluation was performed on the VTAB-1k benchmark and seven adapters, with PVeRA outperforming VeRA and other adapters.
arXiv Detail & Related papers (2025-12-08T16:50:21Z)
Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems [11.584593298674688]
Low-Rank Adaptation (LoRA) has become the de facto method for parameter-efficient fine-tuning of large language models (LLMs)<n>In production, LoRA-based models are served at scale, creating multi-tenant environments with hundreds of adapters sharing a base model.<n>We present LoRAServe, a workload-aware dynamic adapter placement and routing framework designed to tame rank diversity in LoRA serving.
arXiv Detail & Related papers (2025-11-28T05:04:02Z)
A Data-driven ML Approach for Maximizing Performance in LLM-Adapter Serving [2.6336040306318274]
This study focuses on determining the joint configuration of concurrent and parallel adapters that maximizes GPU throughput without inducing starvation.<n>We propose a data-driven ML approach leveraging interpretable models to tackle this caching problem.<n>Experiments with the vLLM framework and LoRA adapters show that the Digital Twin reproduces throughput within 5.1% of real results.
arXiv Detail & Related papers (2025-08-11T10:47:35Z)
Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts [72.22148263683037]
We study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures.<n>First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature.<n>Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks.
arXiv Detail & Related papers (2025-07-09T03:25:45Z)
Come Together, But Not Right Now: A Progressive Strategy to Boost Low-Rank Adaptation [21.137278840000366]
Low-rank adaptation (LoRA) has emerged as a leading parameter-efficient fine-tuning technique for adapting large foundation models.<n>We propose CoTo pruning, a progressive training strategy that gradually increases adapters' activation probability over the course of fine-tuning.
arXiv Detail & Related papers (2025-06-06T03:33:06Z)
PointLoRA: Low-Rank Adaptation with Token Selection for Point Cloud Learning [54.99373314906667]
Self-supervised representation learning for point cloud has demonstrated effectiveness in improving pre-trained model performance across diverse tasks.<n>As pre-trained models grow in complexity, fully fine-tuning them for downstream applications demands substantial computational and storage resources.<n>We propose PointLoRA, a simple yet effective method that combines low-rank adaptation (LoRA) with multi-scale token selection to efficiently fine-tune point cloud models.
arXiv Detail & Related papers (2025-04-22T16:41:21Z)
Towards Optimal Adapter Placement for Efficient Transfer Learning [73.1149084352343]
PETL aims to adapt pre-trained models to new downstream tasks while minimizing the number of fine-tuned parameters. adapters, a popular approach in PETL, inject additional capacity into existing networks by incorporating low-rank projections. This paper investigates the relationship between the placement of an adapter and its performance.
arXiv Detail & Related papers (2024-10-21T10:37:17Z)
Low-Rank Adaptation Secretly Imitates Differentially Private SGD [5.359060261460183]
We show theoretically that low-rank adaptation is equivalent to fine-tuning adapters with noisy batch gradients.<n>We also quantify the variance of the injected noise as a decreasing function of adaptation rank.<n>Low-rank adaptation provides robustness to membership inference attacks w.r.t the fine-tuning data.
arXiv Detail & Related papers (2024-09-26T04:56:49Z)
Adaptive Adapter Routing for Long-Tailed Class-Incremental Learning [55.384428765798496]
New data exhibits a long-tailed distribution, such as e-commerce platform reviews. This necessitates continuous model learning imbalanced data without forgetting. We introduce AdaPtive Adapter RouTing (APART) as an exemplar-free solution for LTCIL.
arXiv Detail & Related papers (2024-09-11T17:52:00Z)
Rapid Switching and Multi-Adapter Fusion via Sparse High Rank Adapters [16.160749645651567]
We propose Sparse High Rank Adapters (SHiRA) that directly finetune 1-2% of the base model weights while leaving others unchanged. This high sparsity incurs no inference overhead, enables rapid switching directly in the fused mode, and significantly reduces concept-loss during multi-adapter fusion.
arXiv Detail & Related papers (2024-07-22T22:46:36Z)
SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters [96.52807311742198]
We re-examine the parameter-efficiency of Adapters through the lens of network pruning. We find that SparseAdapter can achieve comparable or better performance than standard Adapters when the sparse ratio reaches up to 80%.
arXiv Detail & Related papers (2022-10-09T15:28:48Z)
AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.