Related papers: Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling

Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling

URL: http://arxiv.org/abs/2601.08777v1
Date: Tue, 13 Jan 2026 18:08:06 GMT
Title: Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling
Authors: Yang Cai, Weiqiang Zheng,
Abstract summary: We formalize an ideal notion of universal alignment through test-time scaling.<n>We show that popular post-training methods, including Nash learning from human feedback, can underutilize the benefits of test-time scaling.
Score: 14.939480324139288
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces $k\ge 1$ candidate responses and a user selects their preferred one. We introduce $(k,f(k))$-robust alignment, which requires the $k$-output model to have win rate $f(k)$ against any other single-output model, and asymptotic universal alignment (U-alignment), which requires $f(k)\to 1$ as $k\to\infty$. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose $k$-sample product policies achieve U-alignment at rate $f(k)=\frac{k}{k+1}$, and no method can achieve a faster rate in general. We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for $k=1$, sampling from the resulting (often deterministic) policy cannot guarantee win rates above $\tfrac{1}{2}$ except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the $(k+1)$-player alignment game achieves the optimal $(k,\frac{k}{k+1})$-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.

Related papers

Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z)
Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias [0.0]
We give a unified, high-probability characterization for the scaling of the family of parameter norms.<n>We then study linear networks trained by descent.
arXiv Detail & Related papers (2025-09-25T13:59:22Z)
Accelerating Nash Learning from Human Feedback via Mirror Prox [36.04055906691423]
We introduce Nash Mirror Prox ($mathtNash-MP$), an online NLHF algorithm that leverages the Mirror Prox optimization scheme to achieve fast and stable convergence to the Nash equilibrium.<n>Our theoretical analysis establishes that Nash-MP exhibits last-iterate linear convergence towards the $beta$-regularized Nash equilibrium.<n>We show that Nash-MP exhibits last-iterate linear convergence for the exploitability gap and uniformly for the span semi-norm of log-probabilities.
arXiv Detail & Related papers (2025-05-26T09:17:32Z)
Provable Scaling Laws for the Test-Time Compute of Large Language Models [84.00141420901038]
We propose two algorithms that enjoy provable scaling laws for the test-time compute of large language models.<n>One is a two-stage knockout-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.<n>The other is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
Beyond likelihood ratio bias: Nested multi-time-scale stochastic approximation for likelihood-free parameter estimation [49.78792404811239]
We study inference in simulation-based models where the analytical form of the likelihood is unknown.<n>We use a ratio-free nested multi-time-scale approximation (SA) method that simultaneously tracks the score and drives the parameter update.<n>We show that our algorithm can eliminate the original bias $Obig(sqrtfrac1Nbig)$ and accelerate the convergence rate from $Obig(beta_k+sqrtfracalpha_kNbig)$.
arXiv Detail & Related papers (2024-11-20T02:46:15Z)
Transfer Q Star: Principled Decoding for LLM Alignment [105.89114186982972]
Transfer $Q*$ estimates the optimal value function for a target reward $r$ through a baseline model. Our approach significantly reduces the sub-optimality gap observed in prior SoTA methods.
arXiv Detail & Related papers (2024-05-30T21:36:12Z)
Revisiting Step-Size Assumptions in Stochastic Approximation [1.3654846342364308]
It is shown for the first time that this assumption is not required for convergence and finer results.<n>Rates of convergence are obtained for the standard algorithm and for estimates obtained via the averaging technique of Polyak and Ruppert.<n>Results from numerical experiments illustrate that $beta_theta$ may be large due to a combination of multiplicative noise and Markovian memory.
arXiv Detail & Related papers (2024-05-28T05:11:05Z)
Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.<n>We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.<n>We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z)
Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback [75.29048190099523]
Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions. In this paper, we design a fully adaptive OGD algorithm, textsfAdaOGD, that does not require a priori knowledge of these parameters.
arXiv Detail & Related papers (2023-10-21T18:38:13Z)
Federated Learning in the Presence of Adversarial Client Unavailability [16.201377650598516]
Federated learning is a decentralized machine learning framework that enables collaborative model without revealing raw data. Due to the diverse hardware software limitations, a client may not always be available for the computation requests from the server. In harsh environments like battlefields, adversaries can selectively silence specific clients.
arXiv Detail & Related papers (2023-05-31T15:57:07Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Randomized Smoothing of All Shapes and Sizes [29.40896576138737]
We show that for an appropriate notion of "optimal", the optimal smoothing for any "nice" norms have level sets given by the norm's *Wulff Crystal* We show fundamental limits to current randomized smoothing techniques via the theory of *Banach space cotypes*.
arXiv Detail & Related papers (2020-02-19T11:41:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.