Related papers: WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding

URL: http://arxiv.org/abs/2507.16768v1
Date: Tue, 22 Jul 2025 17:13:47 GMT
Title: WGRAMMAR: Leverage Prior Knowledge to Accelerate Structured Decoding
Authors: Ran Wang, Xiaoxuan Liu, Hao Ren, Gang Chen, Fanchao Qi, Maosong Sun,
Abstract summary: We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching.<n> wgrammar achieves up to 250x speedup over existing systems.
Score: 58.1177179119881
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structured decoding enables large language models (LLMs) to generate outputs in formats required by downstream systems, such as HTML or JSON. However, existing methods suffer from efficiency bottlenecks due to grammar compilation, state tracking, and mask creation. We observe that many real-world tasks embed strong prior knowledge about output structure. Leveraging this, we propose a decomposition of constraints into static and dynamic components -- precompiling static structures offline and instantiating dynamic arguments at runtime using grammar snippets. Instead of relying on pushdown automata, we employ a compositional set of operators to model regular formats, achieving lower transition latency. We introduce wgrammar, a lightweight decoding engine that integrates domain-aware simplification, constraint decomposition, and mask caching, achieving up to 250x speedup over existing systems. wgrammar's source code is publicly available at https://github.com/wrran/wgrammar.

Related papers

Earley-Driven Dynamic Pruning for Efficient Structured Decoding [6.905994258490735]
Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging.<n>We propose a novel strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time.<n>We implement a new constrained decoding engine called Formatron which also incorporates existing optimizations.
arXiv Detail & Related papers (2025-06-01T20:05:30Z)
Guided Tensor Lifting [54.10411390218929]
Domain-specific languages (s) for machine learning are revolutionizing the speed and efficiency of machine learning workloads.<n>To take advantage of these capabilities, a user must first translate their legacy code from the language it is currently written in, into the new DSL.<n>Process of automatically lifting code into these DSLs has been identified by several recent works, which propose program synthesis as a solution.
arXiv Detail & Related papers (2025-04-28T12:00:10Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.<n>We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.<n>We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [3.9417976759908573]
Context-free grammar is a flexible approach to enable structured generation via constrained decoding.<n>XGrammar is a flexible and efficient structure generation engine for large language models.<n>XGrammar can achieve up to 100x speedup over existing solutions.
arXiv Detail & Related papers (2024-11-22T18:01:37Z)
Automata-based constraints for language model decoding [9.137697105669142]
Language models (LMs) are often expected to generate strings in some formal language. tuning requires significant resources, making it impractical for uncommon or task-specific formats. We solve these issues through the application of automata theory. Our system compiles constraints 7,000x faster, is provably correct, and can be extended in a modular fashion.
arXiv Detail & Related papers (2024-07-11T00:25:01Z)
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process. This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to better interpret the programming domain knowledge.<n>CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning [27.59524153097858]
grammar-constrained decoding (GCD) can be used to control the generation of large language models (LMs) GCD can serve as a unified framework for structured NLP tasks in general. We show that grammar-constrained LMs substantially outperform unconstrained LMs or even beat task-specific finetuned models.
arXiv Detail & Related papers (2023-05-23T11:54:37Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.