Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- URL: http://arxiv.org/abs/2407.21787v3
- Date: Mon, 30 Dec 2024 19:03:24 GMT
- Title: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- Authors: Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher RĂ©, Azalia Mirhoseini,
- Abstract summary: We explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model.
Across multiple tasks and models, we observe that coverage scales with the number of samples over four orders of magnitude.
In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance.
- Score: 81.34900892130929
- License:
- Abstract: Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget.
Related papers
- Single-Step Consistent Diffusion Samplers [8.758218443992467]
Existing sampling algorithms typically require many iterative steps to produce high-quality samples.
We introduce consistent diffusion samplers, a new class of samplers designed to generate high-fidelity samples in a single step.
We show that our approach yields high-fidelity samples using less than 1% of the network evaluations required by traditional diffusion samplers.
arXiv Detail & Related papers (2025-02-11T14:25:52Z) - Differentially Private Multi-Sampling from Distributions [4.292685318253575]
We study the sample complexity of DP emphsingle-sampling i.e., the minimum number of samples needed to perform this task.
We define two variants of emphmulti-sampling, where the goal is to privately approximate $m>1$ samples.
arXiv Detail & Related papers (2024-12-13T19:14:05Z) - Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem.
A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - Keep Guessing? When Considering Inference Scaling, Mind the Baselines [45.21178011740911]
Scaling inference compute in large language models consistently increases the coverage (fraction of problems solved) as the number of samples increases.
We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers.
arXiv Detail & Related papers (2024-10-20T18:43:05Z) - Testing properties of distributions in the streaming model [0.0]
We study distribution testing in the standard access model and the conditional access model.
The goal is to test the properties of distribution using an optimal number of samples subject to a memory constraint.
arXiv Detail & Related papers (2023-09-06T10:53:29Z) - Learning Large Scale Sparse Models [6.428186644949941]
We consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions.
We propose to learn sparse models such as Lasso in an online manner where in each, only one randomly chosen sample is revealed to update a sparse gradient.
Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient.
arXiv Detail & Related papers (2023-01-26T06:29:49Z) - Saliency Grafting: Innocuous Attribution-Guided Mixup with Calibrated
Label Mixing [104.630875328668]
Mixup scheme suggests mixing a pair of samples to create an augmented training sample.
We present a novel, yet simple Mixup-variant that captures the best of both worlds.
arXiv Detail & Related papers (2021-12-16T11:27:48Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Anytime Sampling for Autoregressive Models via Ordered Autoencoding [88.01906682843618]
Autoregressive models are widely used for tasks such as image and audio generation.
The sampling process of these models does not allow interruptions and cannot adapt to real-time computational resources.
We propose a new family of autoregressive models that enables anytime sampling.
arXiv Detail & Related papers (2021-02-23T05:13:16Z) - One for More: Selecting Generalizable Samples for Generalizable ReID
Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function.
Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.