Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning
- URL: http://arxiv.org/abs/2602.19041v1
- Date: Sun, 22 Feb 2026 04:33:51 GMT
- Title: Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning
- Authors: Jiahao Zhang, Lujing Zhang, Keltin Grimes, Zhuohao Yu, Gokul Swamy, Zhiwei Steven Wu,
- Abstract summary: We propose a novel, game-theoretic solution concept -- the $textitMaximum Entropy Blackwell Winner$ ($textitMaxEntBW$)<n>We then apply $textttPROSPER$ to the problem of fine-tuning large language models from multi-objective LLM-as-a-Judge feedback.<n>We find that $textttPROSPER$ outperforms all baselines considered across both instruction following and general chat benchmarks.
- Score: 31.96149633106621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept -- the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$) -- that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.
Related papers
- Thompson Sampling for Multi-Objective Linear Contextual Bandit [29.777578580338584]
We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously.<n>We propose textttMOL-TS, the textitfirst Thompson Sampling algorithm with Pareto regret guarantees for this problem.<n> Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.
arXiv Detail & Related papers (2025-11-30T15:18:01Z) - Reinforcement Learning from Adversarial Preferences in Tabular MDPs [62.73758165845971]
We introduce a new framework of episodic Markov decision processes (MDPs) with adversarial preferences.<n>Unlike standard episodic MDPs with adversarial losses, in PbMDPs the learner instead observes preferences between two candidate arms.<n>We develop algorithms that achieve a regret bound of order $T2/3$ under known transitions.
arXiv Detail & Related papers (2025-07-15T20:19:32Z) - FedSVD: Adaptive Orthogonalization for Private Federated Learning with LoRA [68.44043212834204]
Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)<n>Low-Rank Adaptation (LoRA) is widely used for efficient fine-tuning of language models in learning (FL)
arXiv Detail & Related papers (2025-05-19T07:32:56Z) - $f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization [54.94545757220999]
$f$-PO is a novel framework that generalizes and extends existing approaches.<n>We conduct experiments on state-of-the-art language models using benchmark datasets.
arXiv Detail & Related papers (2024-10-29T02:11:45Z) - Combinatorial Logistic Bandits [30.829239785016934]
We introduce a novel framework called logistic bandits (CLogB)<n>In each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary.<n>Experiments on real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.
arXiv Detail & Related papers (2024-10-22T14:52:46Z) - Optimal level set estimation for non-parametric tournament and crowdsourcing problems [49.75262185577198]
Motivated by crowdsourcing, we consider a problem where we partially observe the correctness of the answers of $n$ experts on $d$ questions.
In this paper, we assume that the matrix $M$ containing the probability that expert $i$ answers correctly to question $j$ is bi-isotonic up to a permutation of it rows and columns.
We construct an efficient-time algorithm that turns out to be minimax optimal for this classification problem.
arXiv Detail & Related papers (2024-08-27T18:28:31Z) - An Efficient Recipe for Long Context Extension via Middle-Focused Positional Encoding [25.20222970947923]
We propose a method to extend the context length of pre-trained large language models (LLMs)
$textttCREAM$ interpolates positional encodings by manipulating position indices.
Experiments show that $textttCREAM$ successfully extends LLMs to the target length for both Base and Chat versions of $texttLlama2-7B$ with "Never Miss A Beat"
arXiv Detail & Related papers (2024-06-11T10:35:49Z) - Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model.
Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z) - A Generalized Scalarization Method for Evolutionary Multi-Objective
Optimization [6.902116920364312]
This paper uses the global replacement algorithm (GR) as the backbone.
We find that the $L_p$-based ($1leq pinfty$) subproblems have inconsistently large preference regions.
We propose a generalized $L_p$ (G$L_p$) scalarization to ensure that the subproblem's direction vector passes through its preference region.
arXiv Detail & Related papers (2022-12-03T05:55:04Z) - Supervised Training of Conditional Monge Maps [107.78770597815242]
Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another.
We introduce CondOT, a multi-task approach to estimate a family of OT maps conditioned on a context variable.
We demonstrate the ability of CondOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells.
arXiv Detail & Related papers (2022-06-28T19:34:44Z) - Towards Painless Policy Optimization for Constrained MDPs [46.12526917024248]
We study policy optimization in an infinite horizon, $gamma$-discounted constrained Markov decision process (CMDP)
Our objective is to return a policy that achieves large expected reward with a small constraint violation.
We propose a generic primal-dual framework that allows us to bound the reward sub-optimality and constraint violation for arbitrary algorithms.
arXiv Detail & Related papers (2022-04-11T15:08:09Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Accommodating Picky Customers: Regret Bound and Exploration Complexity
for Multi-Objective Reinforcement Learning [43.75491612671571]
We consider multi-objective reinforcement learning where the objectives are balanced using preferences.
We formalize this problem as an episodic learning problem on a Markov decision process.
We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $widetildemathcalObigl(sqrtmind,Scdot H3 SA/epsilon2bigr)$.
arXiv Detail & Related papers (2020-11-25T21:45:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.