Related papers: Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Earley-Driven Dynamic Pruning for Efficient Structured Decoding

URL: http://arxiv.org/abs/2506.01151v1
Date: Sun, 01 Jun 2025 20:05:30 GMT
Title: Earley-Driven Dynamic Pruning for Efficient Structured Decoding
Authors: Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni,
Abstract summary: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging.<n>We propose a novel strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time.<n>We implement a new constrained decoding engine called Formatron which also incorporates existing optimizations.
Score: 6.905994258490735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.

Related papers

WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding [58.1177179119881]
We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching.<n> wgrammar achieves up to 250x speedup over existing systems.
arXiv Detail & Related papers (2025-07-22T17:13:47Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
Flexible and Efficient Grammar-Constrained Decoding [5.671312847528642]
Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules.<n>Existing GCD algorithms require tens of minutes to preprocess common grammars.<n>We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches.
arXiv Detail & Related papers (2025-02-07T17:35:17Z)
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [3.9417976759908573]
Context-free grammar is a flexible approach to enable structured generation via constrained decoding.<n>XGrammar is a flexible and efficient structure generation engine for large language models.<n>XGrammar can achieve up to 100x speedup over existing solutions.
arXiv Detail & Related papers (2024-11-22T18:01:37Z)
SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications [9.143856130336783]
Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference.<n>Agentic frameworks submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs.<n>We introduce emphSuffixDecoding, a novel method that utilizes efficient suffix trees to cache long token sequences.
arXiv Detail & Related papers (2024-11-07T18:49:33Z)
Open-domain Implicit Format Control for Large Language Model Generation [52.83173553689678]
We introduce a novel framework for controlled generation in large language models (LLMs) This study investigates LLMs' capabilities to follow open-domain, one-shot constraints and replicate the format of the example answers. We also develop a dataset collection methodology for supervised fine-tuning that enhances the open-domain format control of LLMs without degrading output quality.
arXiv Detail & Related papers (2024-08-08T11:51:45Z)
Automata-based constraints for language model decoding [9.137697105669142]
Language models (LMs) are often expected to generate strings in some formal language. tuning requires significant resources, making it impractical for uncommon or task-specific formats. We solve these issues through the application of automata theory. Our system compiles constraints 7,000x faster, is provably correct, and can be extended in a modular fashion.
arXiv Detail & Related papers (2024-07-11T00:25:01Z)
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process. This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z)
Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models. HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
Guiding LLMs The Right Way: Fast, Non-Invasive Constrained Generation [7.687678490751105]
We present a novel decoding algorithm, DOMINO, that can enforce constraints in a fully subword-aligned fashion, while leveraging pre-computation and speculative decoding to achieve virtually no overhead and in some cases even almost 2$times$ speedup over unconstrained decoding -- thereby outperforming existing approaches by a wide margin.
arXiv Detail & Related papers (2024-02-07T13:36:02Z)
LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures. This work identifies three major factors contributing to this length generalization failure. We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z)
AMOM: Adaptive Masking over Masking for Conditional Masked Language Model [81.55294354206923]
A conditional masked language model (CMLM) is one of the most versatile frameworks. We introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder. Our proposed model yields state-of-the-art performance on neural machine translation.
arXiv Detail & Related papers (2023-03-13T20:34:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.