Related papers: Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models

Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models

URL: http://arxiv.org/abs/2411.19477v2
Date: Fri, 07 Feb 2025 07:08:29 GMT
Title: Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models
Authors: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou,
Abstract summary: We propose two principled algorithms for the test-time compute of large language models.<n>We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
Score: 70.07661254213181
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose two simple yet principled algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs), which require a black-box LLM and nothing else (e.g., no external verifier or reward model) for a minimalistic implementation. (i) The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them for a final output, via a knockout tournament where pairwise comparisons among the candidates are conducted. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. (ii) The second one is a two-stage league-style algorithm, where each candidate solution is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under certain technical assumptions that are analogous to but more robust than those required by the knockout-style algorithm, we prove theoretically that the failure probability of the league-style algorithm also decays to zero exponentially as its test-time compute grows. Through extensive experiments with two challenging benchmarks, namely GPQA and MMLU-Pro, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.

Related papers

Solving quadratic binary optimization problems using quantum SDP methods: Non-asymptotic running time analysis [1.9081120388919084]
Quantum computers can solve semidefinite programs (SDPs) using resources that scale better than state-of-the-art classical methods. We present an analysis of the non-asymptotic resource requirements of a quantum SDP solver.
arXiv Detail & Related papers (2025-02-21T12:54:05Z)
Sample-Optimal Locally Private Hypothesis Selection and the Provable Benefits of Interactivity [8.100854060749212]
We study the problem of hypothesis selection under the constraint of local differential privacy. We devise an $varepsilon$-locally-differentially-private ($varepsilon$-LDP) algorithm that uses $Thetaleft(fracklog kalpha2min varepsilon2,1 right)$ to guarantee that $d_TV(h,hatf)leq alpha + 9 min_fin mathcalF
arXiv Detail & Related papers (2023-12-09T19:22:10Z)
An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting. Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function. We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z)
Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering. We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset. We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z)
Private estimation algorithms for stochastic block models and mixture models [63.07482515700984]
General tools for designing efficient private estimation algorithms. First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z)
Mind the gap: Achieving a super-Grover quantum speedup by jumping to the end [114.3957763744719]
We present a quantum algorithm that has rigorous runtime guarantees for several families of binary optimization problems. We show that the algorithm finds the optimal solution in time $O*(2(0.5-c)n)$ for an $n$-independent constant $c$. We also show that for a large fraction of random instances from the $k$-spin model and for any fully satisfiable or slightly frustrated $k$-CSP formula, statement (a) is the case.
arXiv Detail & Related papers (2022-12-03T02:45:23Z)
Clustering with Queries under Semi-Random Noise [13.817228853960655]
We develop robust learning methods that tolerate general semi-random noise. We show that information theoretically $Oleft(fracnk log n (1-2p)2right)$ queries suffice to learn any cluster of sufficiently large size.
arXiv Detail & Related papers (2022-06-09T16:02:00Z)
Choosing the Right Algorithm With Hints From Complexity Theory [16.33500498939925]
We show that the Metropolis algorithm is clearly the best of all algorithms regarded for reasonable problem sizes. An artificial algorithm of this type having an $O(n log n)$ runtime leads to the result that the significance-based compact genetic algorithm (sig-cGA) can solve the DLB problem in time $O(n log n)$ with high probability.
arXiv Detail & Related papers (2021-09-14T11:12:32Z)
Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms. For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime. In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem. We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z)
Faster Matchings via Learned Duals [31.61057940283721]
We combine the idea of machine-learned predictions with the idea of "starting-warm" primal-dual algorithms. First, predicted duals may be infeasible, so we give an algorithm that efficiently maps predicted infeasible duals to nearby feasible solutions. Second, once the duals are feasible, they may not be optimal, so we show that they can be used to quickly find an optimal solution.
arXiv Detail & Related papers (2021-07-20T21:11:09Z)
Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex Decentralized Optimization Over Time-Varying Networks [79.16773494166644]
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network. We design two optimal algorithms that attain these lower bounds. We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-08T15:54:44Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step. Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z)
Sparse Hashing for Scalable Approximate Model Counting: Theory and Practice [36.8421113576893]
Given a CNF formula F on n variables, the problem of model counting or #SAT is to compute the number of satisfying assignments of F. Recent years have witnessed a surge of effort towards developing efficient algorithmic techniques.
arXiv Detail & Related papers (2020-04-30T11:17:26Z)
A New Minimax Theorem for Randomized Algorithms [1.2284934135116514]
We introduce a new type of minimax theorem which can provide a hard distribution $mu$ that works for all bias levels at once. We show that this works for randomized query complexity, randomized communication complexity, approximate degreelemma, and approximate logrank. We also prove an improved version of Impagliazzo's hardcore.
arXiv Detail & Related papers (2020-02-25T11:46:08Z)
Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution. We show that the constraint of local differential privacy incurs an exponential increase in cost. Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.