Related papers: Evaluating Parameter Efficient Methods for RLVR

Evaluating Parameter Efficient Methods for RLVR

URL: http://arxiv.org/abs/2512.23165v2
Date: Tue, 30 Dec 2025 13:11:05 GMT
Title: Evaluating Parameter Efficient Methods for RLVR
Authors: Qingyu Yin, Yulun Wu, Zhennan Shen, Sunbowen Li, Zhilin Wang, Yanshu Li, Chak Tou Leong, Jiale Kang, Jinjin Gu,
Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) incentivizes language models to enhance their reasoning capabilities through verifiable feedback.<n>While methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified.<n>We conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks.
Score: 38.45552186628944
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.

Related papers

Learning Rate Matters: Vanilla LoRA May Suffice for LLM Fine-tuning [48.66442009036754]
Low-Rank Adaptation (LoRA) is the prevailing approach for efficient large language model fine-tuning.<n>In this work, we re-evaluate four representative LoRA variants alongside vanilla LoRA.<n>We find that different LoRA methods favor distinct learning rate ranges.
arXiv Detail & Related papers (2026-02-04T19:36:20Z)
A Unified Study of LoRA Variants: Taxonomy, Review, Codebase, and Empirical Evaluation [22.672020176368083]
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that balances efficiency and performance in large-scale neural networks.<n>This work presents the first unified study of LoRA variants, offering a systematic taxonomy, unified theoretical review, structured, and standardized empirical assessment.
arXiv Detail & Related papers (2026-01-30T08:30:05Z)
The Path Not Taken: RLVR Provably Learns Off the Principals [85.41043469428365]
We show that sparsity is a surface artifact of a model-conditioned optimization bias.<n>We mechanistically explain these dynamics with a Three-Gate Theory.<n>We provide a parameter-level characterization of RLVR's learning dynamics.
arXiv Detail & Related papers (2025-11-11T18:49:45Z)
CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models [85.315711639214]
We introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model's own intrinsic sense of curiosity to guide exploration.<n>For the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture.<n>Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses.
arXiv Detail & Related papers (2025-09-11T17:59:17Z)
Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS [62.22644307952087]
We introduce AIRL-S, the first natural unification of RL-based and search-based TTS.<n>We leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces.<n>Our results show that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o.
arXiv Detail & Related papers (2025-08-19T23:41:15Z)
Effective Inference-Free Retrieval for Learned Sparse Representations [19.54810957623511]
Learned Sparse Retrieval (LSR) is an effective IR approach that exploits pre-trained language models for encoding text into a learned bag of words.<n>Recently, new efficient -- inverted index-based -- retrieval engines have been proposed, leading to a natural question: has the role of regularization changed in training LSR models?<n>We show that regularization can be relaxed to produce more effective LSR encoders.
arXiv Detail & Related papers (2025-04-30T09:10:46Z)
RepLoRA: Reparameterizing Low-Rank Adaptation via the Perspective of Mixture of Experts [37.43961020113692]
Low-rank Adaptation (LoRA) has emerged as a powerful method for fine-tuning large-scale foundation models.<n>This paper presents a theoretical analysis of LoRA by examining its connection to the Mixture of Experts models.
arXiv Detail & Related papers (2025-02-05T10:03:09Z)
Robust Federated Finetuning of LLMs via Alternating Optimization of LoRA [10.756801183126525]
We propose RoLoRA, a federated framework using alternating optimization to fine-tune LoRA adapters.<n>We use both theoretical analysis and extensive experiments to demonstrate the advantages of RoLoRA over prior approaches.
arXiv Detail & Related papers (2025-02-03T19:02:00Z)
SD-LoRA: Scalable Decoupled Low-Rank Adaptation for Class Incremental Learning [73.93639228235622]
Continual Learning with foundation models has emerged as a promising paradigm to exploit abundant knowledge acquired during pre-training for tackling sequential tasks.<n>Existing prompt-based and Low-Rank Adaptation-based (LoRA-based) methods often require expanding a prompt/LoRA pool or retaining samples of previous tasks.<n>We propose Scalable Decoupled LoRA (SD-LoRA) for class incremental learning, which continually separates the learning of the magnitude and direction of LoRA components without rehearsal.
arXiv Detail & Related papers (2025-01-22T20:00:41Z)
Task-Specific Directions: Definition, Exploration, and Utilization in Parameter Efficient Fine-Tuning [65.31677646659895]
Large language models demonstrate impressive performance on downstream tasks, yet they require extensive resource consumption when fully fine-tuning all parameters.<n>We propose a framework to clearly define task-specific directions (TSDs) and explore their properties and practical utilization challenges.<n>We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process.
arXiv Detail & Related papers (2024-09-02T08:10:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.