Related papers: Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models

Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models

URL: http://arxiv.org/abs/2503.21929v1
Date: Thu, 27 Mar 2025 19:15:43 GMT
Title: Local Normalization Distortion and the Thermodynamic Formalism of Decoding Strategies for Large Language Models
Authors: Tom Kempton, Stuart Burrell,
Abstract summary: We develop the theory of decoding strategies for language models by expressing popular decoding algorithms as equilibrium states in the language of ergodic theory.<n>We analyze the effect of the local normalization step of top-k, nucleus, and temperature sampling, used to make probabilities sum to one.<n>Contrary to the prevailing explanation, we argue that the major cause of the under-performance of top-k sampling relative to nucleus sampling is local normalization distortion.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Advances in hardware and language model architecture have spurred a revolution in natural language generation. However, autoregressive models compute probability distributions over next-token choices, and sampling from these distributions, known as decoding, has received significantly less attention than other design choices. Existing decoding strategies are largely based on heuristics, resulting in methods that are hard to apply or improve in a principled manner. We develop the theory of decoding strategies for language models by expressing popular decoding algorithms as equilibrium states in the language of ergodic theory and stating the functions they optimize. Using this, we analyze the effect of the local normalization step of top-k, nucleus, and temperature sampling, used to make probabilities sum to one. We argue that local normalization distortion is a fundamental defect of decoding strategies and quantify the size of this distortion and its effect on mathematical proxies for the quality and diversity of generated text. Contrary to the prevailing explanation, we argue that the major cause of the under-performance of top-k sampling relative to nucleus sampling is local normalization distortion. This yields conclusions for the future design of decoding algorithms and the detection of machine-generated text.

Related papers

Efficient and Asymptotically Unbiased Constrained Decoding for Large Language Models [46.43567715840425]
This paper introduces Dynamic Importance Sampling for Constrained Parallel Prefix-Verification (PPV) PPV is a novel algorithm that leverages dynamic importance sampling to achieve theoretically guaranteed unbiasedness and overcomes the inefficiency of prefix-tree. Experiments demonstrate the superiority of our method over existing methods in both efficiency and output quality.
arXiv Detail & Related papers (2025-04-12T08:49:21Z)
A Theoretical Perspective for Speculative Decoding Algorithm [60.79447486066416]
One effective way to accelerate inference is emphSpeculative Decoding, which employs a small model to sample a sequence of draft tokens and a large model to validate. This paper tackles this gap by conceptualizing the decoding problem via markov chain abstraction and studying the key properties, emphoutput quality and inference acceleration, from a theoretical perspective.
arXiv Detail & Related papers (2024-10-30T01:53:04Z)
Local and Global Decoding in Text Generation [36.38298679687864]
Text generation relies on decoding algorithms that sample strings from a language model distribution. We investigate the effect of distortion by introducing globally-normalised versions of these decoding methods. Our results suggest that distortion is an important feature of local decoding algorithms.
arXiv Detail & Related papers (2024-10-14T17:59:38Z)
Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours) Our method is based on the reformulation of the standard probabilistic flow models. Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z)
Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts. We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z)
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System [73.52878118434147]
We present methods to reverse-engineer the decoding method used to generate text. Our ability to discover which decoding strategy was used has implications for detecting generated text.
arXiv Detail & Related papers (2023-09-09T18:19:47Z)
An Optimal, Universal and Agnostic Decoding Method for Message Reconstruction, Bio and Technosignature Detection [0.14061979259370275]
We present an agnostic signal reconstruction method for zero-knowledge one-way communication channels.<n>We investigate how non-random messages encode information about their intended physical properties.
arXiv Detail & Related papers (2023-03-28T15:20:25Z)
A Simple, Yet Effective Approach to Finding Biases in Code Generation [16.094062131137722]
This work shows that current code generation systems exhibit undesired biases inherited from their large language model backbones. We propose the "block of influence" concept, which enables a modular decomposition and analysis of the coding challenges.
arXiv Detail & Related papers (2022-10-31T15:06:15Z)
Deep Equilibrium Assisted Block Sparse Coding of Inter-dependent Signals: Application to Hyperspectral Imaging [71.57324258813675]
A dataset of inter-dependent signals is defined as a matrix whose columns demonstrate strong dependencies. A neural network is employed to act as structure prior and reveal the underlying signal interdependencies. Deep unrolling and Deep equilibrium based algorithms are developed, forming highly interpretable and concise deep-learning-based architectures.
arXiv Detail & Related papers (2022-03-29T21:00:39Z)
A Contrastive Framework for Neural Text Generation [46.845997620234265]
We show that an underlying reason for model degeneration is the anisotropic distribution of token representations. We present a contrastive solution: (i) SimCTG, a contrastive training objective to calibrate the model's representation space, and (ii) a decoding method -- contrastive search -- to encourage diversity while maintaining coherence in the generated text.
arXiv Detail & Related papers (2022-02-13T21:46:14Z)
Contextualized Perturbation for Textual Adversarial Attack [56.370304308573274]
Adversarial examples expose the vulnerabilities of natural language processing (NLP) models. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs.
arXiv Detail & Related papers (2020-09-16T06:53:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.