Related papers: Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning

Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning

URL: http://arxiv.org/abs/2602.19041v1
Date: Sun, 22 Feb 2026 04:33:51 GMT
Title: Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning
Authors: Jiahao Zhang, Lujing Zhang, Keltin Grimes, Zhuohao Yu, Gokul Swamy, Zhiwei Steven Wu,
Abstract summary: We propose a novel, game-theoretic solution concept -- the $textitMaximum Entropy Blackwell Winner$ ($textitMaxEntBW$)<n>We then apply $textttPROSPER$ to the problem of fine-tuning large language models from multi-objective LLM-as-a-Judge feedback.<n>We find that $textttPROSPER$ outperforms all baselines considered across both instruction following and general chat benchmarks.
Score: 31.96149633106621
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept -- the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$) -- that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.

Related papers

Thompson Sampling for Multi-Objective Linear Contextual Bandit [29.777578580338584]
We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously.<n>We propose textttMOL-TS, the textitfirst Thompson Sampling algorithm with Pareto regret guarantees for this problem.<n> Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.
arXiv Detail & Related papers (2025-11-30T15:18:01Z)
Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z)
FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [68.44043212834204]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z)
$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [54.94545757220999]
$f$-PO is a novel framework that generalizes and extends existing approaches.<n>We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z)
Combinatorial Logistic Bandits [30.829239785016934]
We introduce a novel framework called logistic bandits (CLogB)<n>In each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary.<n>Experiments on real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.
arXiv Detail & Related papers (2024-10-22T14:52:46Z)
Optimal level set estimation for non-parametric tournament and crowdsourcing problems [49.75262185577198]
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions. In this paper, we assume that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns. We construct an efficient-time algorithm that turns out to be minimax optimal for this classification problem.
arXiv Detail & Related papers (2024-08-27T18:28:31Z)
An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding [25.20222970947923]
We propose a method to extend the context length of pre-trained large language models (LLMs) $textttCREAM$ interpolates positional encodings by manipulating position indices. Experiments show that $textttCREAM$ successfully extends LLMs to the target length for both Base and Chat versions of $texttLlama2-7B$ with "Never Miss A Beat"
arXiv Detail & Related papers (2024-06-11T10:35:49Z)
Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z)
A Generalized Scalarization Method for Evolutionary Multi-Objective Optimization [6.902116920364312]
This paper uses the global replacement algorithm (GR) as the backbone. We find that the $L_p$-based ($1leq pinfty$) subproblems have inconsistently large preference regions. We propose a generalized $L_p$ (G$L_p$) scalarization to ensure that the subproblem's direction vector passes through its preference region.
arXiv Detail & Related papers (2022-12-03T05:55:04Z)
Supervised Training of Conditional Monge Maps [107.78770597815242]
Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another. We introduce CondOT, a multi-task approach to estimate a family of OT maps conditioned on a context variable. We demonstrate the ability of CondOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells.
arXiv Detail & Related papers (2022-06-28T19:34:44Z)
Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP) Our objective is to return a policy that achieves large expected reward with a small constraint violation. We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z)
Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$. We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z)
Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning [43.75491612671571]
We consider multi-objective reinforcement learning where the objectives are balanced using preferences. We formalize this problem as an episodic learning problem on a Markov decision process. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $widetildemathcalObigl(sqrtmind,Scdot H3 SA/epsilon2bigr)$.
arXiv Detail & Related papers (2020-11-25T21:45:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.