Ranking-aware Reinforcement Learning for Ordinal Ranking
- URL: http://arxiv.org/abs/2601.20585v1
- Date: Wed, 28 Jan 2026 13:22:42 GMT
- Title: Ranking-aware Reinforcement Learning for Ordinal Ranking
- Authors: Aiming Hao, Chen Zhu, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu,
- Abstract summary: We propose Ranking-Aware Reinforcement Learning (RARL), a novel RL framework that explicitly learns these relationships.<n>RARL features a unified objective that integrates regression and Learning-to-Rank (L2R), enabling mutual improvement between the two tasks.<n>To further enhance training, we introduce Response Mutation Operations (RMO), which inject controlled noise to improve exploration and prevent stagnation at saddle points.
- Score: 19.678002354790582
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ordinal regression and ranking are challenging due to inherent ordinal dependencies that conventional methods struggle to model. We propose Ranking-Aware Reinforcement Learning (RARL), a novel RL framework that explicitly learns these relationships. At its core, RARL features a unified objective that synergistically integrates regression and Learning-to-Rank (L2R), enabling mutual improvement between the two tasks. This is driven by a ranking-aware verifiable reward that jointly assesses regression precision and ranking accuracy, facilitating direct model updates via policy optimization. To further enhance training, we introduce Response Mutation Operations (RMO), which inject controlled noise to improve exploration and prevent stagnation at saddle points. The effectiveness of RARL is validated through extensive experiments on three distinct benchmarks.
Related papers
- Search-R2: Enhancing Search-Integrated Reasoning via Actor-Refiner Collaboration [49.9937230730202]
We propose Search-R2, a novel Actor-Refiner collaboration framework that enhances reasoning through targeted intervention.<n>Our approach decomposes the generation process into an Actor, which produces initial reasoning trajectories.<n>We show that Search-R2 consistently outperforms strong RAG and RL-based baselines across model scales.
arXiv Detail & Related papers (2026-02-03T15:32:09Z) - From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning [7.6602542594279335]
We propose Reinforcement Learning with Relative Rewards to shift reward shaping from absolute scoring to relative ranking.<n>We show that RLRR yields consistent performance improvements over standard group-based baselines across reasoning benchmarks and open-ended generation tasks.
arXiv Detail & Related papers (2026-01-30T15:07:06Z) - ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation [54.071574153853994]
ProRAG is a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop.<n>Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism.
arXiv Detail & Related papers (2026-01-29T16:04:59Z) - ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [84.07076200941474]
ArenaRL is a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking.<n>We construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals.<n>Experiments show that ArenaRL substantially outperforms standard RL baselines.
arXiv Detail & Related papers (2026-01-10T08:43:07Z) - Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning [137.33138614095435]
Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models.<n>Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval.<n>We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions.
arXiv Detail & Related papers (2025-11-12T08:29:39Z) - Rethinking Reasoning in Document Ranking: Why Chain-of-Thought Falls Short [36.93384080571354]
Document reranking is a key component in information retrieval (IR)<n>We present the first systematic study of reasoning in reranking across both pointwise and listwise settings.
arXiv Detail & Related papers (2025-10-10T03:59:17Z) - Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models [50.84995206660551]
We introduce Conditional advANtage estimatiON (CANON) to amplify the impact of a target metric without presuming its direction.<n>CANON based on entropy consistently outperforms prior methods on both math reasoning and high-complexity logic tasks.
arXiv Detail & Related papers (2025-09-28T16:33:07Z) - ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability [83.16850534680505]
We propose an automated reasoning-intensive training data synthesis framework.<n>A self-consistency data filtering mechanism is designed to ensure the data quality.<n>Our trained reasoning-intensive reranker textbfReasonRank achieves state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard.
arXiv Detail & Related papers (2025-08-09T17:26:18Z) - R1-Ranker: Teaching LLM Rankers to Reason [35.35360001710222]
R1-Ranker is a reasoning-incentive framework built on reinforcement learning.<n>IRanker decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning.<n>We evaluate unified R1-Rankers on nine datasets spanning recommendation, routing, and passage ranking.
arXiv Detail & Related papers (2025-06-25T17:56:06Z) - KARE-RAG: Knowledge-Aware Refinement and Enhancement for RAG [63.82127103851471]
Retrieval-Augmented Generation (RAG) enables large language models to access broader knowledge sources.<n>We demonstrate that enhancing generative models' capacity to process noisy content is equally critical for robust performance.<n>We present KARE-RAG, which improves knowledge utilization through three key innovations.
arXiv Detail & Related papers (2025-06-03T06:31:17Z) - Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [2.743898388459522]
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve.<n>Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments.<n>We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps.
arXiv Detail & Related papers (2024-10-16T14:15:28Z) - Prior Constraints-based Reward Model Training for Aligning Large Language Models [58.33118716810208]
This paper proposes a Prior Constraints-based Reward Model (namely PCRM) training method to mitigate this problem.
PCRM incorporates prior constraints, specifically, length ratio and cosine similarity between outputs of each comparison pair, during reward model training to regulate optimization magnitude and control score margins.
Experimental results demonstrate that PCRM significantly improves alignment performance by effectively constraining reward score scaling.
arXiv Detail & Related papers (2024-04-01T07:49:11Z) - Fine-grained Correlation Loss for Regression [20.175415393263037]
We propose to revisit the classic regression tasks with novel investigations on directly optimizing the fine-grained correlation losses.
We extensively validate our method on two typical ultrasound image regression tasks, including the image quality assessment and bio-metric measurement.
arXiv Detail & Related papers (2022-07-01T11:25:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.