Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding
- URL: http://arxiv.org/abs/2504.20456v1
- Date: Tue, 29 Apr 2025 06:33:13 GMT
- Title: Reviving Any-Subset Autoregressive Models with Principled Parallel Sampling and Speculative Decoding
- Authors: Gabe Guo, Stefano Ermon,
- Abstract summary: In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution.<n>We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution.<n>We show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation.
- Score: 55.2480439325792
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In arbitrary-order language models, it is an open question how to sample tokens in parallel from the correct joint distribution. With discrete diffusion models, the more tokens they generate in parallel, the less their predicted distributions adhere to the originally learned data distribution, as they rely on a conditional independence assumption that only works with infinitesimally small timesteps. We find that a different class of models, any-subset autoregressive models (AS-ARMs), holds the solution. As implied by the name, AS-ARMs can generate tokens in any order, and in parallel. Moreover, AS-ARMs support parallelized joint probability density estimation, allowing them to correct their own parallel-generated token distributions, via our Any-Subset Speculative Decoding (ASSD) algorithm. ASSD provably enables generation of tokens from the correct joint distribution, with the number of neural network calls upper bounded by the number of tokens predicted. We empirically verify that ASSD speeds up language generation, without sacrificing quality. Furthermore, we provide a mathematically justified scheme for training AS-ARMs for generation, and show that AS-ARMs achieve state-of-the-art performance among sub-200M parameter models on infilling benchmark tasks, and nearly match the performance of models 50X larger on code generation. Our theoretical and empirical results indicate that the once-forgotten AS-ARMs are a promising direction of language modeling.
Related papers
- Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs)<n>We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length.<n>PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)<n>This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.<n>We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [59.445765313094434]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.<n>The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Autoregressive Diffusion Models [34.125045462636386]
We introduce Autoregressive Diffusion Models (ARDMs), a model class encompassing and generalizing order-agnostic autoregressive models.
ARDMs are simple to implement and easy to train, and can be trained using an efficient objective similar to modern probabilistic diffusion models.
We show that ARDMs obtain compelling results not only on complete datasets, but also on compressing single data points.
arXiv Detail & Related papers (2021-10-05T13:36:55Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.