Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
- URL: http://arxiv.org/abs/2506.05748v1
- Date: Fri, 06 Jun 2025 05:18:54 GMT
- Title: Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
- Authors: Rudransh Agnihotri, Ananya Pandey,
- Abstract summary: Reward-model training is the cost bottleneck in modern Reinforcement Learning Human Feedback (RLHF) pipelines.<n>In the proposed method, a frozen, instruction-tuned 7B LLM is augmented with only a one line and a rank-16 LoRA adapter.<n>The plug-and-play judge 96.2% achieves accuracy on RewardBench, outperforming specialized reward networks ranging from 27B to 70B parameters.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reward-model training is the cost bottleneck in modern Reinforcement Learning Human Feedback (RLHF) pipelines, often requiring tens of billions of parameters and an offline preference-tuning phase. In the proposed method, a frozen, instruction-tuned 7B LLM is augmented with only a one line JSON rubric and a rank-16 LoRA adapter (affecting just 0.8% of the model's parameters), enabling it to serve as a complete substitute for the previously used heavyweight evaluation models. The plug-and-play judge achieves 96.2% accuracy on RewardBench, outperforming specialized reward networks ranging from 27B to 70B parameters. Additionally, it allows a 7B actor to outperform the top 70B DPO baseline, which scores 61.8%, by achieving 92% exact match accuracy on GSM-8K utilizing online PPO. Thorough ablations indicate that (i) six in context demonstrations deliver the majority of the zero-to-few-shot improvements (+2pp), and (ii) the LoRA effectively addresses the remaining disparity, particularly in the safety and adversarial Chat-Hard segments. The proposed model introduces HH-Rationales, a subset of 10,000 pairs from Anthropic HH-RLHF, to examine interpretability, accompanied by human generated justifications. GPT-4 scoring indicates that our LoRA judge attains approximately = 9/10 in similarity to human explanations, while zero-shot judges score around =5/10. These results indicate that the combination of prompt engineering and tiny LoRA produces a cost effective, transparent, and easily adjustable reward function, removing the offline phase while achieving new state-of-the-art outcomes for both static evaluation and online RLHF.
Related papers
- Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation [85.89510825889168]
We introduce LoRA-Pre, a novel low-rank system for efficient pre-training.<n>LoRA-Pre decomposing the momentum matrix into a compact low-rank subspace within the online linear learner.<n>We empirically validate LoRA-Pre's efficacy by pre-training models from the Llama architecture family.
arXiv Detail & Related papers (2026-02-27T18:57:06Z) - Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.26461635308796]
We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment.<n>Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models.<n>We introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.
arXiv Detail & Related papers (2026-02-04T15:24:52Z) - Bayesian-LoRA: Probabilistic Low-Rank Adaptation of Large Language Models [5.653755499165773]
We introduce Bayesian-LoRA, which reformulates the deterministic LoRA update as a probabilistic low-rank representation inspired by Sparse Gaussian Processes.<n>With only approximately 0.42M additional parameters and $approx1.2times$ training cost relative to standard LoRA, Bayesian-LoRA significantly improves calibration across models up to 30B.
arXiv Detail & Related papers (2026-01-28T19:54:31Z) - Accuracy and Efficiency Trade-Offs in LLM-Based Malware Detection and Explanation: A Comparative Study of Parameter Tuning vs. Full Fine-Tuning [0.0]
Low-Rank Adaptation (LoRA) fine-tuned Large Language Models (LLMs) can approximate the performance of fully fine-tuned models in generating human-interpretable decisions and explanations for malware classification.<n>LoRA offers a practical balance of interpretability and resource efficiency, enabling deployment in resource-constrained environments without sacrificing explanation quality.
arXiv Detail & Related papers (2025-11-24T19:37:13Z) - SPARK: Synergistic Policy And Reward Co-Evolving Framework [84.22494672256894]
We introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR.<n>Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model.<n>We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks.
arXiv Detail & Related papers (2025-09-26T17:50:12Z) - LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z) - SingLoRA: Low Rank Adaptation Using a Single Matrix [7.828928639229988]
Low-Rank Adaptation (LoRA) has significantly advanced parameter-efficient fine-tuning of large pretrained models.<n>We propose SingLoRA, which reformulates low-rank adaptation by learning the weights update as a decomposition of a single low-rank matrix multiplied by its transpose.
arXiv Detail & Related papers (2025-07-08T01:11:30Z) - RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation [59.34193580856381]
Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning large language models.<n>We propose RoRA (Rank-adaptive Reliability Optimization), a simple yet effective method for optimizing LoRA's scaling factor.<n>RoRA ensures improved performance as rank size increases and excels in the more challenging task of accuracy recovery when fine-tuning pruned models.
arXiv Detail & Related papers (2025-01-08T07:13:52Z) - Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning [15.776175440446414]
We introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation.<n>Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal.<n>We preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW.
arXiv Detail & Related papers (2024-11-04T18:54:39Z) - Scalable Reinforcement Post-Training Beyond Static Human Prompts: Evolving Alignment via Asymmetric Self-Play [52.3079697845254]
eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training.<n>We show eva can create effective RL curricula and is robust across ablations.
arXiv Detail & Related papers (2024-10-31T08:15:32Z) - Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - RRM: Robust Reward Model Training Mitigates Reward Hacking [51.12341734942797]
Reward models (RMs) play a pivotal role in aligning large language models with human preferences.<n>We introduce a causal framework that learns preferences independent of these artifacts.<n>Experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model.
arXiv Detail & Related papers (2024-09-20T01:46:07Z) - FLoCoRA: Federated learning compression with low-rank adaptation [0.0]
Low-Rank Adaptation (LoRA) methods have gained popularity in efficient parameter fine-tuning of models containing hundreds of billions of parameters.
In this work, we demonstrate the application of LoRA methods to train small-vision models in Federated Learning.
arXiv Detail & Related papers (2024-06-20T07:59:29Z) - MLAE: Masked LoRA Experts for Visual Parameter-Efficient Fine-Tuning [45.93128932828256]
Masked LoRA Experts (MLAE) is an innovative approach that applies the concept of masking to visual PEFT.
Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices.
We show that MLAE achieves new state-of-the-art (SOTA) performance with an average accuracy score of 78.8% on the VTAB-1k benchmark and 90.9% on the FGVC benchmark.
arXiv Detail & Related papers (2024-05-29T08:57:23Z) - Sparse Low-rank Adaptation of Pre-trained Language Models [79.74094517030035]
We introduce sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process.
Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters.
Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
arXiv Detail & Related papers (2023-11-20T11:56:25Z) - Exploring the impact of low-rank adaptation on the performance,
efficiency, and regularization of RLHF [47.960563851948514]
We investigate an efficient implementation of RLHF using low-rank adaptation (LoRA)
Our implementation achieves better performance than the publicly-released AlpacaFarm checkpoint with full model fine-tuning.
We release our code and pretrained checkpoints to facilitate future research on more efficient RLHF.
arXiv Detail & Related papers (2023-09-16T17:31:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.