Related papers: Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters

URL: http://arxiv.org/abs/2506.10641v1
Date: Thu, 12 Jun 2025 12:27:41 GMT
Title: Spelling-out is not Straightforward: LLMs' Capability of Tokenization from Token to Characters
Authors: Tatsuya Hiraoka, Kentaro Inui,
Abstract summary: Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks.<n>We investigate how LLMs internally represent and utilize character-level information during the spelling-out process.
Score: 25.430820735194768
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) can spell out tokens character by character with high accuracy, yet they struggle with more complex character-level tasks, such as identifying compositional subcomponents within tokens. In this work, we investigate how LLMs internally represent and utilize character-level information during the spelling-out process. Our analysis reveals that, although spelling out is a simple task for humans, it is not handled in a straightforward manner by LLMs. Specifically, we show that the embedding layer does not fully encode character-level information, particularly beyond the first character. As a result, LLMs rely on intermediate and higher Transformer layers to reconstruct character-level knowledge, where we observe a distinct "breakthrough" in their spelling behavior. We validate this mechanism through three complementary analyses: probing classifiers, identification of knowledge neurons, and inspection of attention weights.

Related papers

CharBench: Evaluating the Role of Tokenization in Character-Level Tasks [3.937454839700144]
CharBench is a benchmark of character-level tasks that is two orders of magnitude larger than existing alternatives.<n>We present an analysis of how intrinsic properties of words and their segmentations into tokens correspond to model performance.<n>For counting tasks, we find that tokenization properties are weakly correlated with correctness, while the length of the queried word and the actual character count play a more significant part.
arXiv Detail & Related papers (2025-08-04T16:46:15Z)
Enhancing LLM Character-Level Manipulation via Divide and Conquer [74.55804812450164]
Large Language Models (LLMs) have demonstrated strong generalization capabilities across a wide range of natural language processing (NLP) tasks.<n>They exhibit notable weaknesses in character-level string manipulation, struggling with fundamental operations such as character deletion, insertion, and substitution.<n>We propose Character-Level Manipulation via Divide and Conquer, a novel approach designed to bridge the gap between token-level processing and character-level manipulation.
arXiv Detail & Related papers (2025-02-12T07:37:39Z)
Enhancing Character-Level Understanding in LLMs through Token Internal Structure Learning [20.801571525710834]
Token Internal Position Awareness (TIPA) is a method that significantly improves models' ability to capture character positions within tokens.<n>TIPA enhances position prediction accuracy in large language models, enabling more precise identification of target characters in original text.
arXiv Detail & Related papers (2024-11-26T18:44:39Z)
Vulnerability of LLMs to Vertically Aligned Text Manipulations [108.6908427615402]
Vertical text input is commonly encountered in various real-world applications, such as mathematical computations and word-based Sudoku puzzles.<n>Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks.
arXiv Detail & Related papers (2024-10-26T00:16:08Z)
CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks. This raises the question: To what extent can LLMs learn orthographic information? We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z)
C-LLM: Learn to Check Chinese Spelling Errors Character by Character [61.53865964535705]
We propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character. C-LLM achieves an average improvement of 10% over existing methods.
arXiv Detail & Related papers (2024-06-24T11:16:31Z)
CHIRON: Rich Character Representations in Long-Form Narratives [98.273323001781]
We propose CHIRON, a new character sheet' based representation that organizes and filters textual information about characters.<n>We validate CHIRON via the downstream task of masked-character prediction, where our experiments show CHIRON is better and more flexible than comparable summary-based baselines.<n> metrics derived from CHIRON can be used to automatically infer character-centricity in stories, and that these metrics align with human judgments.
arXiv Detail & Related papers (2024-06-14T17:23:57Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
What do tokens know about their characters and how do they know it? [3.8254443661593633]
We show that pre-trained language models that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information. We show that these models robustly encode character-level information and, in general, larger models perform better at the task.
arXiv Detail & Related papers (2022-06-06T13:27:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.