Related papers: SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding

SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding

URL: http://arxiv.org/abs/2601.09089v1
Date: Wed, 14 Jan 2026 02:45:08 GMT
Title: SubTokenTest: A Practical Benchmark for Real-World Sub-token Understanding
Authors: Shuyang Hou, Yi Hu, Muhan Zhang,
Abstract summary: We introduce SubTokenTest, a benchmark that assesses sub-token understanding through practical, utility-driven tasks.<n>Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning.
Score: 40.45653552579818
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced their reasoning capabilities. However, they continue to struggle with basic character-level tasks, such as counting letters in words, a problem rooted in their tokenization process. While existing benchmarks have highlighted this weakness through basic character operations, such failures are often dismissed due to lacking practical relevance. Yet, many real-world applications, such as navigating text-based maps or interpreting structured tables, rely heavily on precise sub-token understanding. In this regard, we introduce SubTokenTest, a comprehensive benchmark that assesses sub-token understanding through practical, utility-driven tasks. Our benchmark includes ten tasks across four domains and isolates tokenization-related failures by decoupling performance from complex reasoning. We provide a comprehensive evaluation of nine advanced LLMs. Additionally, we investigate the impact of test-time scaling on sub-token reasoning and explore how character-level information is encoded within the hidden states.

Related papers

FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z)
ELAIPBench: A Benchmark for Expert-Level Artificial Intelligence Paper Understanding [49.67493845115009]
ELAIPBench is a benchmark curated by domain experts to evaluate large language models' comprehension of AI research papers.<n>It spans three difficulty levels and emphasizes non-trivial reasoning rather than shallow retrieval.<n>Experiments show that the best-performing LLM achieves an accuracy of only 39.95%, far below human performance.
arXiv Detail & Related papers (2025-10-12T11:11:20Z)
CharBench: Evaluating the Role of Tokenization in Character-Level Tasks [3.937454839700144]
CharBench is a benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives.<n>We present an analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance.<n>For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part.
arXiv Detail & Related papers (2025-08-04T16:46:15Z)
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation [1.7268889851975326]
We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks.<n>Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases.
arXiv Detail & Related papers (2025-05-13T23:47:12Z)
Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.<n>They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.<n>We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
TaskEval: Assessing Difficulty of Code Generation Tasks for Large Language Models [7.3673614578648285]
Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty.<n>This paper introduces a framework using diverse prompts and Item Response Theory (IRT) to efficiently assess LLMs' capabilities and benchmark task characteristics.
arXiv Detail & Related papers (2024-07-30T22:31:19Z)
H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables [56.73919743039263]
This paper introduces a novel algorithm that integrates both symbolic and semantic (textual) approaches in a two-stage process to address limitations.<n>Our experiments demonstrate that H-STAR significantly outperforms state-of-the-art methods across three question-answering (QA) and fact-verification datasets.
arXiv Detail & Related papers (2024-06-29T21:24:19Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult. We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.