APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
- URL: http://arxiv.org/abs/2510.10963v1
- Date: Mon, 13 Oct 2025 03:13:28 GMT
- Title: APLOT: Robust Reward Modeling via Adaptive Preference Learning with Optimal Transport
- Authors: Zhuo Li, Yuege Feng, Dandan Guo, Jinpeng Hu, Anningzhe Gao, Xiang Wan,
- Abstract summary: The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning.<n>This paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism.
- Score: 37.21695864040979
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The reward model (RM) plays a crucial role in aligning Large Language Models (LLMs) with human preferences through Reinforcement Learning, where the Bradley-Terry (BT) objective has been recognized as simple yet powerful, specifically for pairwise preference learning. However, BT-based RMs often struggle to effectively distinguish between similar preference responses, leading to insufficient separation between preferred and non-preferred outputs. Consequently, they may easily overfit easy samples and cannot generalize well to Out-Of-Distribution (OOD) samples, resulting in suboptimal performance. To address these challenges, this paper introduces an effective enhancement to BT-based RMs through an adaptive margin mechanism. Specifically, we design to dynamically adjust the RM focus on more challenging samples through margins, based on both semantic similarity and model-predicted reward differences, which is approached from a distributional perspective solvable with Optimal Transport (OT). By incorporating these factors into a principled OT cost matrix design, our adaptive margin enables the RM to better capture distributional differences between chosen and rejected responses, yielding significant improvements in performance, convergence speed, and generalization capabilities. Experimental results across multiple benchmarks demonstrate that our method outperforms several existing RM techniques, showcasing enhanced performance in both In-Distribution (ID) and OOD settings. Moreover, RLHF experiments support our practical effectiveness in better aligning LLMs with human preferences. Our code is available at https://github.com/BIRlz/APLOT
Related papers
- Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling [58.59644539594293]
DiNa-LRM is a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states.<n>Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty.<n>Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines.
arXiv Detail & Related papers (2026-02-11T18:57:29Z) - Bayesian Preference Learning for Test-Time Steerable Reward Models [9.817455218872645]
We propose Variational In-Context Reward Modeling, which enables test-time steerability via in-context preference demonstrations.<n>We show that I CRM gains 34% accuracy on SafeRLHF and 9% accuracy on RM-Bench in the single-objective setting.<n>We further study the practical applicability of I CRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning.
arXiv Detail & Related papers (2026-02-09T15:55:56Z) - Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling [34.20750590384272]
Process reward models (PRMs) are a cornerstone of test-time scaling (TTS)<n>PRMs are designed to verify and select the best responses from large language models (LLMs)
arXiv Detail & Related papers (2025-10-15T09:08:51Z) - Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization [44.14678335188207]
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs)<n>Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning.<n>This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method.
arXiv Detail & Related papers (2025-10-09T13:59:50Z) - Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks [81.44256822500257]
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences.<n> RLHF exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks.<n>We propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.
arXiv Detail & Related papers (2025-05-19T08:33:11Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation [20.959453238159863]
Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging.<n>Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data.<n>We propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning.
arXiv Detail & Related papers (2025-01-23T18:59:47Z) - Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback [64.67540769692074]
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been instrumental in developing some of the most capable AI systems to date.<n>We introduce an approach called Margin Matching Preference Optimization (MMPO), which incorporates relative quality margins into optimization, leading to improved LLM policies and reward models.<n>Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench.
arXiv Detail & Related papers (2024-10-04T04:56:11Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - On Diversified Preferences of Large Language Model Alignment [51.26149027399505]
This paper presents the first quantitative analysis of the experimental scaling law for reward models with varying sizes.
Our analysis reveals that the impact of diversified human preferences depends on both model size and data size.
Larger models with sufficient capacity mitigate the negative effects of diverse preferences, while smaller models struggle to accommodate them.
arXiv Detail & Related papers (2023-12-12T16:17:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.