Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models
- URL: http://arxiv.org/abs/2411.19477v2
- Date: Fri, 07 Feb 2025 07:08:29 GMT
- Title: Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models
- Authors: Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou,
- Abstract summary: We propose two principled algorithms for the test-time compute of large language models.
We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
- Score: 70.07661254213181
- License:
- Abstract: We propose two simple yet principled algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs), which require a black-box LLM and nothing else (e.g., no external verifier or reward model) for a minimalistic implementation. (i) The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them for a final output, via a knockout tournament where pairwise comparisons among the candidates are conducted. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. (ii) The second one is a two-stage league-style algorithm, where each candidate solution is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under certain technical assumptions that are analogous to but more robust than those required by the knockout-style algorithm, we prove theoretically that the failure probability of the league-style algorithm also decays to zero exponentially as its test-time compute grows. Through extensive experiments with two challenging benchmarks, namely GPQA and MMLU-Pro, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.
Related papers
- Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - Choosing the Right Algorithm With Hints From Complexity Theory [16.33500498939925]
We show that the Metropolis algorithm is clearly the best of all algorithms regarded for reasonable problem sizes.
An artificial algorithm of this type having an $O(n log n)$ runtime leads to the result that the significance-based compact genetic algorithm (sig-cGA) can solve the DLB problem in time $O(n log n)$ with high probability.
arXiv Detail & Related papers (2021-09-14T11:12:32Z) - Machine Learning for Online Algorithm Selection under Censored Feedback [71.6879432974126]
In online algorithm selection (OAS), instances of an algorithmic problem class are presented to an agent one after another, and the agent has to quickly select a presumably best algorithm from a fixed set of candidate algorithms.
For decision problems such as satisfiability (SAT), quality typically refers to the algorithm's runtime.
In this work, we revisit multi-armed bandit algorithms for OAS and discuss their capability of dealing with the problem.
We adapt them towards runtime-oriented losses, allowing for partially censored data while keeping a space- and time-complexity independent of the time horizon.
arXiv Detail & Related papers (2021-09-13T18:10:52Z) - Faster Matchings via Learned Duals [31.61057940283721]
We combine the idea of machine-learned predictions with the idea of "starting-warm" primal-dual algorithms.
First, predicted duals may be infeasible, so we give an algorithm that efficiently maps predicted infeasible duals to nearby feasible solutions.
Second, once the duals are feasible, they may not be optimal, so we show that they can be used to quickly find an optimal solution.
arXiv Detail & Related papers (2021-07-20T21:11:09Z) - Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex
Decentralized Optimization Over Time-Varying Networks [79.16773494166644]
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network.
We design two optimal algorithms that attain these lower bounds.
We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-08T15:54:44Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Sparse Hashing for Scalable Approximate Model Counting: Theory and
Practice [36.8421113576893]
Given a CNF formula F on n variables, the problem of model counting or #SAT is to compute the number of satisfying assignments of F.
Recent years have witnessed a surge of effort towards developing efficient algorithmic techniques.
arXiv Detail & Related papers (2020-04-30T11:17:26Z) - Learning Sparse Classifiers: Continuous and Mixed Integer Optimization
Perspectives [10.291482850329892]
Mixed integer programming (MIP) can be used to solve (to optimality) $ell_0$-regularized regression problems.
We propose two classes of scalable algorithms: an exact algorithm that can handlepapprox 50,000$ features in a few minutes, and approximate algorithms that can address instances with $papprox6$.
In addition, we present new estimation error bounds for $ell$-regularizeds.
arXiv Detail & Related papers (2020-01-17T18:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.