Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- URL: http://arxiv.org/abs/2407.21787v2
- Date: Mon, 16 Sep 2024 17:58:42 GMT
- Title: Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
- Authors: Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher RĂ©, Azalia Mirhoseini,
- Abstract summary: We explore inference compute as another axis for scaling by increasing the number of generated samples.
In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance.
We find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers.
- Score: 81.34900892130929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit the amount of compute to only one attempt per problem. Here, we explore inference compute as another axis for scaling by increasing the number of generated samples. Across multiple tasks and models, we observe that coverage - the fraction of problems solved by any attempt - scales with the number of samples over four orders of magnitude. In domains like coding and formal proofs, where all answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-V2-Coder-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-attempt state-of-the-art of 43% which uses more capable frontier models. Moreover, using current API pricing, amplifying the cheaper DeepSeek model with five samples is more cost-effective and solves more issues than paying a premium for one sample from GPT-4o or Claude 3.5 Sonnet. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. Finally, we find that identifying correct samples out of many generations remains an important direction for future research in domains without automatic verifiers. When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.
Related papers
- Quasi-random Multi-Sample Inference for Large Language Models [1.647759094903376]
Large language models (LLMs) are often equipped with multi-sample decoding strategies.
Traditional text generation methods, such as beam search and sampling-based techniques, have notable limitations.
This study explores the potential of arithmetic sampling, contrasting it with ancestral sampling.
arXiv Detail & Related papers (2024-11-09T18:55:04Z) - Model Equality Testing: Which Model Is This API Serving? [59.005869726179455]
We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem.
A test built on a simple string kernel achieves a median of 77.4% power against a range of distortions.
We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.
arXiv Detail & Related papers (2024-10-26T18:34:53Z) - Keep Guessing? When Considering Inference Scaling, Mind the Baselines [45.21178011740911]
Scaling inference compute in large language models consistently increases the coverage (fraction of problems solved) as the number of samples increases.
We conjecture that this observed improvement is partially due to the answer distribution of standard evaluation benchmarks, which is skewed towards a relatively small set of common answers.
arXiv Detail & Related papers (2024-10-20T18:43:05Z) - Controllable Generation via Locally Constrained Resampling [77.48624621592523]
We propose a tractable probabilistic approach that performs Bayesian conditioning to draw samples subject to a constraint.
Our approach considers the entire sequence, leading to a more globally optimal constrained generation than current greedy methods.
We show that our approach is able to steer the model's outputs away from toxic generations, outperforming similar approaches to detoxification.
arXiv Detail & Related papers (2024-10-17T00:49:53Z) - How much can we forget about Data Contamination? [15.893161447368273]
Leakage of benchmark data into the training data has emerged as a significant challenge for large language models.
We use experimental evidence and theoretical estimates to challenge the common assumption that small-scale contamination renders benchmark evaluations invalid.
arXiv Detail & Related papers (2024-10-04T09:14:11Z) - Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning
and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency.
We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question.
Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z) - Learning Large Scale Sparse Models [6.428186644949941]
We consider learning sparse models in large scale settings, where the number of samples and the feature dimension can grow as large as millions or billions.
We propose to learn sparse models such as Lasso in an online manner where in each, only one randomly chosen sample is revealed to update a sparse gradient.
Thereby, the memory cost is independent of the sample size and gradient evaluation for one sample is efficient.
arXiv Detail & Related papers (2023-01-26T06:29:49Z) - Saliency Grafting: Innocuous Attribution-Guided Mixup with Calibrated
Label Mixing [104.630875328668]
Mixup scheme suggests mixing a pair of samples to create an augmented training sample.
We present a novel, yet simple Mixup-variant that captures the best of both worlds.
arXiv Detail & Related papers (2021-12-16T11:27:48Z) - Error Detection in Large-Scale Natural Language Understanding Systems
Using Transformer Models [0.0]
Large-scale conversational assistants like Alexa, Siri, Cortana and Google Assistant process every utterance using multiple models for domain, intent and named entity recognition.
We address this challenge to detect domain classification errors using offline Transformer models.
We combine utterance encodings from a RoBERTa model with the Nbest hypothesis produced by the production system. We then fine-tune end-to-end in a multitask setting using a small dataset of humanannotated utterances with domain classification errors.
arXiv Detail & Related papers (2021-09-04T00:10:48Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.