Related papers: Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

URL: http://arxiv.org/abs/2504.05410v1
Date: Mon, 07 Apr 2025 18:30:18 GMT
Title: Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling
Authors: Benjamin Lipkin, Benjamin LeBrun, Jacob Hoover Vigly, João Loula, David R. MacIver, Li Du, Jason Eisner, Ryan Cotterell, Vikash Mansinghka, Timothy J. O'Donnell, Alexander K. Lew, Tim Vieira,
Abstract summary: evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
Score: 90.86991492288487
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sampling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive -- LM vocabularies often exceed $100,000$ tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost -- estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.

Related papers

Controlled LLM Decoding via Discrete Auto-regressive Biasing [9.843359827321194]
Controlled text generation allows for enforcing user-defined constraints on large language model outputs.<n>We propose Discrete Auto-regressive Biasing, a controlled decoding algorithm that leverages gradients while operating entirely in the discrete text domain.<n>Our method significantly improves constraint satisfaction while maintaining comparable or better fluency, all with even lower computational costs.
arXiv Detail & Related papers (2025-02-06T00:14:43Z)
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
We propose retrofitting current language models with dynamic tokenization.<n>We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly.<n>We find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages.
arXiv Detail & Related papers (2024-11-27T17:51:58Z)
Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs) This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts. We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
Large Language Models as General Pattern Machines [64.75501424160748]
We show that pre-trained large language models (LLMs) are capable of autoregressively completing complex token sequences. Surprisingly, pattern completion proficiency can be partially retained even when the sequences are expressed using tokens randomly sampled from the vocabulary. In this work, we investigate how these zero-shot capabilities may be applied to problems in robotics.
arXiv Detail & Related papers (2023-07-10T17:32:13Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Improving Pre-trained Language Model Fine-tuning with Noise Stability Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR) Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model. We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.