RewardRank: Optimizing True Learning-to-Rank Utility
- URL: http://arxiv.org/abs/2508.14180v2
- Date: Fri, 17 Oct 2025 21:32:58 GMT
- Title: RewardRank: Optimizing True Learning-to-Rank Utility
- Authors: Gaurav Bhatt, Kiran Koshy Thekumparampil, Tanmay Gangwani, Tesi Xiao, Leonid Sigal,
- Abstract summary: We introduce RewardRank, a data-driven learning-to-rank framework for counterfactual utility.<n>Our results show that learning-to-rank can be reformulated as direct optimization of counterfactual utility.
- Score: 28.662272762911325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional ranking systems optimize offline proxy objectives that rely on oversimplified assumptions about user behavior, often neglecting factors such as position bias and item diversity. Consequently, these models fail to improve true counterfactual utilities such as such as click-through rate or purchase probability, when evaluated in online A/B tests. We introduce RewardRank, a data-driven learning-to-rank (LTR) framework for counterfactual utility maximization. RewardRank first learns a reward model that predicts the utility of any ranking directly from logged user interactions, and then trains a ranker to maximize this reward using a differentiable soft permutation operator. To enable rigorous and reproducible evaluation, we further propose two benchmark suites: (i) Parametric Oracle Evaluation (PO-Eval), which employs an open-source click model as a counterfactual oracle on the Baidu-ULTR dataset, and (ii) LLM-as-User Evaluation (LAU-Eval), which simulates realistic user behavior via large language models on the Amazon-KDD-Cup dataset. RewardRank achieves the highest counterfactual utility across both benchmarks and demonstrates that optimizing classical metrics such as NDCG is sub-optimal for maximizing true user utility. Finally, using real user feedback from the Baidu-ULTR dataset, RewardRank establishes a new state of the art in offline relevance performance. Overall, our results show that learning-to-rank can be reformulated as direct optimization of counterfactual utility, achieved in a purely data-driven manner without relying on explicit modeling assumptions such as position bias. Our code is available at: $https://github.com/GauravBh1010tt/RewardRank$
Related papers
- PROF: An LLM-based Reward Code Preference Optimization Framework for Offline Imitation Learning [29.373324685358753]
We propose PROF, a framework to generate and improve executable reward function codes from natural language descriptions and a single expert trajectory.<n>We also propose Reward Preference Ranking (RPR), a novel reward function quality assessment and ranking strategy without requiring environment interactions or RL training.
arXiv Detail & Related papers (2025-11-14T14:38:02Z) - LLMs for estimating positional bias in logged interaction data [44.839172857330674]
We propose a novel method for estimating position bias using Large Language Models (LLMs)<n>Our experiments show that propensities estimated with our LLM-as-a-judge approach are stable across score buckets.<n>An IPS-weighted reranker trained with these propensities matches the production model on standard NDCG@10 while improving weighted NDCG@10 by roughly 2%.
arXiv Detail & Related papers (2025-09-03T20:26:06Z) - RankList -- A Listwise Preference Learning Framework for Predicting Subjective Preferences [66.76322360727809]
We propose RankList, a listwise preference learning framework that generalizes RankNet to structured list-level supervision.<n>Our formulation explicitly models local and non-local ranking constraints within a probabilistic framework.<n>Extensive experiments demonstrate the superiority of our method across diverse modalities.
arXiv Detail & Related papers (2025-08-13T13:59:41Z) - Personalized Recommendations via Active Utility-based Pairwise Sampling [1.704905100460915]
We propose a utility-based framework that learns preferences from simple and intuitive pairwise comparisons.<n>A central contribution of our work is a novel utility-based active sampling strategy for preference elicitation.
arXiv Detail & Related papers (2025-08-12T19:09:33Z) - From Generation to Consumption: Personalized List Value Estimation for Re-ranking [11.827600847399973]
We propose CAVE, a personalized Consumption-Aware list Value Estimation framework.<n>CAVE formulates the list value as the expectation over sub-list values, weighted by user-specific exit probabilities at each position.<n>By jointly modeling sub-list values and user exit behavior, CAVE yields a more faithful estimate of actual list consumption value.
arXiv Detail & Related papers (2025-08-04T09:43:21Z) - Rank-R1: Enhancing Reasoning in LLM-based Document Rerankers via Reinforcement Learning [76.50690734636477]
We introduce Rank-R1, a novel LLM-based reranker that performs reasoning over both the user query and candidate documents before performing the ranking task.<n>Our experiments on the TREC DL and BRIGHT datasets show that Rank-R1 is highly effective, especially for complex queries.
arXiv Detail & Related papers (2025-03-08T03:14:26Z) - Unbiased Learning to Rank with Query-Level Click Propensity Estimation: Beyond Pointwise Observation and Relevance [74.43264459255121]
In real-world scenarios, users often click only one or two results after examining multiple relevant options.<n>We propose a query-level click propensity model to capture the probability that users will click on different result lists.<n>Our method introduces a Dual Inverse Propensity Weighting mechanism to address both relevance saturation and position bias.
arXiv Detail & Related papers (2025-02-17T03:55:51Z) - CREAM: Consistency Regularized Self-Rewarding Language Models [34.325289477993586]
Self-rewarding large language models (LLM) have successfully applied LLM-as-a-Judge to improve the alignment performance without the need of human annotations for preference data.<n>However, there is no guarantee of accuracy in the rewarding and ranking, which is critical for ensuring accurate rewards and high-quality preference data.<n>We propose a Consistency Regularized sElf-rewarding lAnguage Model (CREAM) that leverages the consistency of rewards across different iterations to regularize the self-rewarding training.
arXiv Detail & Related papers (2024-10-16T16:51:01Z) - Adaptively Learning to Select-Rank in Online Platforms [34.258659206323664]
This research addresses the challenge of adaptively ranking items from a candidate pool for heterogeneous users.
We develop a user response model that considers diverse user preferences and the varying effects of item positions.
Experiments conducted on both simulated and real-world datasets demonstrate our algorithm outperforms the baseline.
arXiv Detail & Related papers (2024-06-07T15:33:48Z) - Learning Fair Ranking Policies via Differentiable Optimization of
Ordered Weighted Averages [55.04219793298687]
This paper shows how efficiently-solvable fair ranking models can be integrated into the training loop of Learning to Rank.
In particular, this paper is the first to show how to backpropagate through constrained optimizations of OWA objectives, enabling their use in integrated prediction and decision models.
arXiv Detail & Related papers (2024-02-07T20:53:53Z) - Sample Complexity of Preference-Based Nonparametric Off-Policy
Evaluation with Deep Networks [58.469818546042696]
We study the sample efficiency of OPE with human preference and establish a statistical guarantee for it.
By appropriately selecting the size of a ReLU network, we show that one can leverage any low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2023-10-16T16:27:06Z) - Replace Scoring with Arrangement: A Contextual Set-to-Arrangement
Framework for Learning-to-Rank [40.81502990315285]
Learning-to-rank is a core technique in the top-N recommendation task, where an ideal ranker would be a mapping from an item set to an arrangement.
Most existing solutions fall in the paradigm of probabilistic ranking principle (PRP), i.e., first score each item in the candidate set and then perform a sort operation to generate the top ranking list.
We propose Set-To-Arrangement Ranking (STARank), a new framework directly generates the permutations of the candidate items without the need for individually scoring and sort operations.
arXiv Detail & Related papers (2023-08-05T12:22:26Z) - Attention Weighted Mixture of Experts with Contrastive Learning for
Personalized Ranking in E-commerce [21.7796124109]
We propose Attention Weighted Mixture of Experts (AW-MoE) with contrastive learning for personalized ranking.
AW-MoE has been successfully deployed in the JD e-commerce search engine.
arXiv Detail & Related papers (2023-06-08T07:59:08Z) - Boosting the Learning for Ranking Patterns [6.142272540492935]
This paper formulates the problem of learning pattern ranking functions as a multi-criteria decision making problem.
Our approach aggregates different interestingness measures into a single weighted linear ranking function, using an interactive learning procedure.
Experiments conducted on well-known datasets show that our approach significantly reduces the running time and returns precise pattern ranking.
arXiv Detail & Related papers (2022-03-05T10:22:44Z) - PiRank: Learning To Rank via Differentiable Sorting [85.28916333414145]
We propose PiRank, a new class of differentiable surrogates for ranking.
We show that PiRank exactly recovers the desired metrics in the limit of zero temperature.
arXiv Detail & Related papers (2020-12-12T05:07:36Z) - Taking the Counterfactual Online: Efficient and Unbiased Online
Evaluation for Ranking [74.46448041224247]
We introduce the novel Logging-Policy Optimization Algorithm (LogOpt), which optimize the policy for logging data.
LogOpt turns the counterfactual approach - which is indifferent to the logging policy - into an online approach, where the algorithm decides what rankings to display.
We prove that, as an online evaluation method, LogOpt is unbiased w.r.t. position and item-selection bias, unlike existing interleaving methods.
arXiv Detail & Related papers (2020-07-24T18:05:58Z) - SetRank: A Setwise Bayesian Approach for Collaborative Ranking from
Implicit Feedback [50.13745601531148]
We propose a novel setwise Bayesian approach for collaborative ranking, namely SetRank, to accommodate the characteristics of implicit feedback in recommender system.
Specifically, SetRank aims at maximizing the posterior probability of novel setwise preference comparisons.
We also present the theoretical analysis of SetRank to show that the bound of excess risk can be proportional to $sqrtM/N$.
arXiv Detail & Related papers (2020-02-23T06:40:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.