Related papers: Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

URL: http://arxiv.org/abs/2410.02725v1
Date: Thu, 3 Oct 2024 17:47:29 GMT
Title: Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation
Authors: Rohin Manvi, Anikait Singh, Stefano Ermon,
Abstract summary: We introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples. We demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average.
Score: 51.127054971591924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.

Related papers

C3PO: Critical-Layer, Core-Expert, Collaborative Pathway Optimization for Test-Time Expert Re-Mixing [21.119495676190127]
Mixture-of-Experts (MoE) Large Language Models (LLMs) suffer from severely sub-optimal expert pathways. naive expert selection learned from pretraining leaves a surprising 10-20% accuracy gap for improvement. We develop a novel class of test-time optimization methods to re-weight or "re-mixing" the experts in different layers jointly for each test sample.
arXiv Detail & Related papers (2025-04-10T17:59:56Z)
Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization [66.67988187816185]
We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance. Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases. We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
arXiv Detail & Related papers (2025-02-24T04:22:57Z)
Sampling in CMA-ES: Low Numbers of Low Discrepancy Points [0.0]
We show that iterating through small, fixed sets of low-discrepancy points can still perform better than the default uniform distribution. For lower dimensionalities, we find that using as little as 32 unique discrepancy points performs similar or better than uniform sampling.
arXiv Detail & Related papers (2024-09-24T10:04:55Z)
ITERTL: An Iterative Framework for Fine-tuning LLMs for RTL Code Generation [9.409062607311528]
Large language models (LLMs) have demonstrated excellent performance in understanding human instructions and generating code. We introduce a simple yet effective iterative training paradigm named ITERTL. We show the model trained through our proposed approach can compete with and even outperform the state-of-the-art (SOTA) open-source model.
arXiv Detail & Related papers (2024-06-28T01:44:57Z)
Priority Sampling of Large Language Models for Compilers [4.2266182821287135]
Priority Sampling is a simple and deterministic sampling technique that produces unique samples ordered by the model's confidence. It supports generation based on regular expression that provides a controllable and structured exploration process. It outperforms the autotuner used for the generation of labels for the training of the original model in just 30 samples.
arXiv Detail & Related papers (2024-02-28T22:27:49Z)
Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning [47.677929366323596]
In semi-supervised learning, unlabeled samples can be utilized through augmentation and consistency regularization. Existing SSL models overlook the characteristics of naive samples, and they just apply the same learning strategy to all samples. We propose Sample adaptive augmentation (SAA) to give attention to naive samples and augmenting them in a more diverse manner.
arXiv Detail & Related papers (2023-09-07T09:50:45Z)
Entropy-based Training Methods for Scalable Neural Implicit Sampler [15.978655106034113]
Efficiently sampling from un-normalized target distributions is a fundamental problem in scientific computing and machine learning. In this paper, we propose an efficient and scalable neural implicit sampler that overcomes these limitations. Our sampler can generate large batches of samples with low computational costs by leveraging a neural transformation that directly maps easily sampled latent vectors to target samples.
arXiv Detail & Related papers (2023-06-08T05:56:05Z)
Let's Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs [60.58434523646137]
A popular approach for improving the correctness of output from large language models (LLMs) is Self-Consistency. We introduce Adaptive-Consistency, a cost-efficient, model-agnostic technique that dynamically adjusts the number of samples per question. Our experiments show that Adaptive-Consistency reduces sample budget by up to 7.9 times with an average accuracy drop of less than 0.1%.
arXiv Detail & Related papers (2023-05-19T17:49:25Z)
ReSmooth: Detecting and Utilizing OOD Samples when Training with Data Augmentation [57.38418881020046]
Recent DA techniques always meet the need for diversity in augmented training samples. An augmentation strategy that has a high diversity usually introduces out-of-distribution (OOD) augmented samples. We propose ReSmooth, a framework that firstly detects OOD samples in augmented samples and then leverages them.
arXiv Detail & Related papers (2022-05-25T09:29:27Z)
Low-variance estimation in the Plackett-Luce model via quasi-Monte Carlo sampling [58.14878401145309]
We develop a novel approach to producing more sample-efficient estimators of expectations in the PL model. We illustrate our findings both theoretically and empirically using real-world recommendation data from Amazon Music and the Yahoo learning-to-rank challenge.
arXiv Detail & Related papers (2022-05-12T11:15:47Z)
Reparameterized Sampling for Generative Adversarial Networks [71.30132908130581]
We propose REP-GAN, a novel sampling method that allows general dependent proposals by REizing the Markov chains into the latent space of the generator. Empirically, extensive experiments on synthetic and real datasets demonstrate that our REP-GAN largely improves the sample efficiency and obtains better sample quality simultaneously.
arXiv Detail & Related papers (2021-07-01T10:34:55Z)
Sampling-Decomposable Generative Adversarial Recommender [84.05894139540048]
We propose a Sampling-Decomposable Generative Adversarial Recommender (SD-GAR) In the framework, the divergence between some generator and the optimum is compensated by self-normalized importance sampling. We extensively evaluate the proposed algorithm with five real-world recommendation datasets.
arXiv Detail & Related papers (2020-11-02T13:19:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.