Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits
- URL: http://arxiv.org/abs/2505.14178v1
- Date: Tue, 20 May 2025 10:32:30 GMT
- Title: Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits
- Authors: Xiang Zhang, Juntai Cao, Jiaqi Wei, Yiwei Xu, Chenyu You,
- Abstract summary: Tokenization is the first - and often underappreciated - layer of computation in language models.<n>We show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs.
- Score: 15.941209553757274
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we show that the success of such reasoning is fundamentally bounded by the structure of tokenized inputs. This work presents a theoretical and empirical investigation into how tokenization schemes, particularly subword-based methods like byte-pair encoding (BPE), impede symbolic computation by merging or obscuring atomic reasoning units. We introduce the notion of Token Awareness to formalize how poor token granularity disrupts logical alignment and prevents models from generalizing symbolic procedures. Through systematic evaluation on arithmetic and symbolic tasks, we demonstrate that token structure dramatically affect reasoning performance, causing failure even with CoT, while atomically-aligned formats unlock strong generalization, allowing small models (e.g., GPT-4o-mini) to outperform larger systems (e.g., o1) in structured reasoning. Our findings reveal that symbolic reasoning ability in LLMs is not purely architectural, but deeply conditioned on token-level representations.
Related papers
- PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective [59.7140089198992]
We develop a mathematic framework that defines abstract reasoning as the ability to extract essential patterns.<n>We introduce two novel complementary metrics: (scoreGamma) measures basic reasoning accuracy, while (scoreDelta) quantifies a model's reliance on specific symbols.
arXiv Detail & Related papers (2025-05-28T09:02:45Z) - Enhancing Logical Reasoning in Language Models via Symbolically-Guided Monte Carlo Process Supervision [38.592071445554836]
Large language models (LLMs) have shown promising performance in mathematical and logical reasoning benchmarks.<n>LLMs are susceptible to content variations, demonstrating a lack of robust symbolic abstractions supporting their reasoning process.<n>Existing approaches fail to effectively leverage symbolic representations due to the challenges involved in developing reliable and scalable verification mechanisms.
arXiv Detail & Related papers (2025-05-26T18:06:39Z) - I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z) - Improving Chain-of-Thought Reasoning via Quasi-Symbolic Abstractions [45.950841507164064]
Chain-of-Though (CoT) represents a common strategy for reasoning in Large Language Models.<n>We present QuaSAR, a variation of CoT that guides LLMs to operate at a higher level of abstraction via quasi-symbolic explanations.<n>Our experiments show that quasi-symbolic abstractions can improve CoT-based methods by up to 8% accuracy.
arXiv Detail & Related papers (2025-02-18T07:58:48Z) - Efficient Reasoning with Hidden Thinking [48.96945580741641]
Chain-of-Thought (CoT) reasoning has become a powerful framework for improving complex problem-solving capabilities.<n>We propose $textbfHeima$ (as hidden llama), an efficient reasoning framework that leverages reasoning CoTs at hidden latent space.<n>Heima model achieves higher generation efficiency while maintaining or even better zero-shot task accuracy.
arXiv Detail & Related papers (2025-01-31T15:10:29Z) - On the Diagram of Thought [12.304069891580658]
Current large language models (LLMs) demonstrate impressive capabilities but struggle with complex, multi-step reasoning tasks.<n>We introduce the Diagram of Thought (DoT) as a framework wherein a single auto-regressive LLM internally constructs and navigates a Directed Acyclic Graph (DAG)<n>We formalize the reasoning DAG as a diagram within a suitable topos and prove that the final step, aggregating validated information, corresponds semantically to computing the colimit of the relevant sub-diagram.
arXiv Detail & Related papers (2024-09-16T07:01:41Z) - The Foundations of Tokenization: Statistical and Computational Concerns [51.370165245628975]
Tokenization is a critical step in the NLP pipeline.<n>Despite its recognized importance as a standard representation method in NLP, the theoretical underpinnings of tokenization are not yet fully understood.<n>The present paper contributes to addressing this theoretical gap by proposing a unified formal framework for representing and analyzing tokenizer models.
arXiv Detail & Related papers (2024-07-16T11:12:28Z) - LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and
Reasoning [73.98142349171552]
LOGICSEG is a holistic visual semantic that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge.
During fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training.
These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models.
arXiv Detail & Related papers (2023-09-24T05:43:19Z) - Evaluating Step-by-Step Reasoning through Symbolic Verification [20.156768135017007]
Pre-trained language models (LMs) have shown remarkable reasoning performance for in-context learning.
LMLP enjoys more than $25%$ higher accuracy than chain-of-thoughts (CoT) on length generalization benchmarks even with smaller model sizes.
arXiv Detail & Related papers (2022-12-16T19:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.