Related papers: TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

URL: http://arxiv.org/abs/2510.14972v1
Date: Thu, 16 Oct 2025 17:59:45 GMT
Title: TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar
Authors: Yinxi Li, Yuntian Deng, Pengyu Nie,
Abstract summary: We show that semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming.<n>We introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization.<n>Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation.
Score: 8.34539885321864
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

Related papers

Say Anything but This: When Tokenizer Betrays Reasoning in LLMs [0.7162422068114824]
Large language models (LLMs) reason over discrete token ID sequences.<n>Modern subword tokenizers routinely produce non-unique encodings.<n>We show that tokenization can betray LLM reasoning through one-to-many token ID mappings.
arXiv Detail & Related papers (2026-01-21T05:09:09Z)
Token Sugar: Making Source Code Sweeter for LLMs through Token-Efficient Shorthand [12.853934439806908]
We propose Token Sugar, a concept that replaces frequent and verbose code patterns with reversible, token-efficient shorthand in the source code.<n>With this solution, we obtain 799 (code pattern, shorthand) pairs, which can reduce up to 15.1% token count in the source code.<n> Experimental results show that these models not only achieve significant token savings (up to 11.2% reduction) during generation but also maintain near-identical Pass@1 scores compared to baselines trained on unprocessed code.
arXiv Detail & Related papers (2025-12-09T05:42:23Z)
Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation [50.93756215410832]
This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding.<n>The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed.
arXiv Detail & Related papers (2025-10-20T14:02:37Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
Tokenization is Sensitive to Language Variation [14.568179478275255]
Tokenizers split texts into smaller units and might behave differently for less common linguistic forms.<n>This might affect downstream LLM performance differently on two types of tasks.<n>We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance.
arXiv Detail & Related papers (2025-02-21T09:58:54Z)
Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language. We show that Byte-Pair. Match (BPE) and MaxPiece (WordPiece) fit within this framework. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.<n>CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.<n>Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs [5.549095839198671]
Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages. We propose a novel method to assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X.
arXiv Detail & Related papers (2024-01-11T14:27:43Z)
Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT. We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z)
Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space. These spaces should be transferable to classes beyond those seen during training. This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.