Flexible and Efficient Grammar-Constrained Decoding
- URL: http://arxiv.org/abs/2502.05111v1
- Date: Fri, 07 Feb 2025 17:35:17 GMT
- Title: Flexible and Efficient Grammar-Constrained Decoding
- Authors: Kanghee Park, Timothy Zhou, Loris D'Antoni,
- Abstract summary: Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules.
Existing GCD algorithms require tens of minutes to preprocess common grammars.
We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches.
- Score: 5.671312847528642
- License:
- Abstract: Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.
Related papers
- Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs)
Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.
We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z) - XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [3.9417976759908573]
Context-free grammar is a flexible approach to enable structured generation via constrained decoding.
XGrammar is a flexible and efficient structure generation engine for large language models.
XGrammar can achieve up to 100x speedup over existing solutions.
arXiv Detail & Related papers (2024-11-22T18:01:37Z) - Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling [53.58854856174773]
Speculative decoding is an approach to accelerate inference through a guess-and-verify paradigm.
Token Recycling stores candidate tokens in an adjacency matrix and employs a breadth-first search algorithm.
It significantly outperforms existing train-free methods by 30% and even a training method by 25%.
arXiv Detail & Related papers (2024-08-16T12:20:56Z) - Grammar-Aligned Decoding [30.972850034752884]
Large Language Models (LLMs) struggle with reliably generating highly structured outputs, such as program code, mathematical formulas, or well-formed markup.
Constrained decoding approaches mitigate this problem by greedily restricting what tokens an LLM can output at each step to guarantee that the output matches a given constraint.
In this paper, we demonstrate that GCD techniques can distort the LLM's distribution, leading to outputs that are grammatical but appear with likelihoods that are not proportional to the ones given by the LLM.
arXiv Detail & Related papers (2024-05-31T17:39:15Z) - Constructing a BPE Tokenization DFA [0.0]
Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem.
We give and analyze an algorithm for the efficient construction of deterministic finite automata designed to operate directly on tokenizations produced by the popular byte pair encoding technique.
arXiv Detail & Related papers (2024-05-13T11:59:24Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Constrained Decoding for Fill-in-the-Middle Code Language Models via Efficient Left and Right Quotienting of Context-Sensitive Grammars [11.279507894576213]
This paper contributes an incremental synthesis that allows early rejection of syntactically incorrect code.
We extend the Earley parsing algorithm to allow for left and right quotients of context-free grammars.
arXiv Detail & Related papers (2024-02-28T02:12:47Z) - SeqXGPT: Sentence-Level AI-Generated Text Detection [62.3792779440284]
We introduce a sentence-level detection challenge by synthesizing documents polished with large language models (LLMs)
We then propose textbfSequence textbfX (Check) textbfGPT, a novel method that utilizes log probability lists from white-box LLMs as features for sentence-level AIGT detection.
arXiv Detail & Related papers (2023-10-13T07:18:53Z) - Grammar-Constrained Decoding for Structured NLP Tasks without Finetuning [27.59524153097858]
grammar-constrained decoding (GCD) can be used to control the generation of large language models (LMs)
GCD can serve as a unified framework for structured NLP tasks in general.
We show that grammar-constrained LMs substantially outperform unconstrained LMs or even beat task-specific finetuned models.
arXiv Detail & Related papers (2023-05-23T11:54:37Z) - Reducing Sequence Length by Predicting Edit Operations with Large
Language Models [50.66922361766939]
This paper proposes predicting edit spans for the source text for local sequence transduction tasks.
We apply instruction tuning for Large Language Models on the supervision data of edit spans.
Experiments show that the proposed method achieves comparable performance to the baseline in four tasks.
arXiv Detail & Related papers (2023-05-19T17:51:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.