How Do Language Models Acquire Character-Level Information?
- URL: http://arxiv.org/abs/2602.05347v1
- Date: Thu, 05 Feb 2026 06:19:51 GMT
- Title: How Do Language Models Acquire Character-Level Information?
- Authors: Soma Sato, Ryohei Sasano,
- Abstract summary: We analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings with those trained under standard settings.<n>Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization.
- Score: 13.183615639007941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.
Related papers
- Excess Description Length of Learning Generalizable Predictors [7.527535569795127]
We develop a formal information-theoretic framework for quantifying how much predictive structure fine-tuning extracts from a train dataset.<n>Our central quantity, Excess Description Length (EDL), is defined via prequential coding.<n>We establish that EDL is non-negative in expectation, converges to surplus description length in the infinite-data limit, and provides bounds on expected generalization gain.
arXiv Detail & Related papers (2026-01-08T08:46:42Z) - TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior [30.782240245074433]
Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs)<n>TokSuite is a collection of models and a benchmark that supports research into tokenization's influence on LMs.
arXiv Detail & Related papers (2025-12-23T20:43:06Z) - Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi [0.29687381456163997]
This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish.<n>We assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure)<n>Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity.
arXiv Detail & Related papers (2025-08-18T16:26:42Z) - Benchmarking Prosody Encoding in Discrete Speech Tokens [13.60092490447892]
This study focuses on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens.<n>In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features.
arXiv Detail & Related papers (2025-08-15T05:11:16Z) - How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z) - Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning [29.745218855471787]
Tokenization is a necessary component within the current architecture of many language mod-els.<n>We argue that tokenization is necessary for reasonably human-like language performance.<n>We discuss implications for architectural choices, meaning construction, the primacy of language for thought.
arXiv Detail & Related papers (2024-12-14T18:18:52Z) - Identifying Semantic Induction Heads to Understand In-Context Learning [103.00463655766066]
We investigate whether attention heads encode two types of relationships between tokens present in natural languages.
We find that certain attention heads exhibit a pattern where, when attending to head tokens, they recall tail tokens and increase the output logits of those tail tokens.
arXiv Detail & Related papers (2024-02-20T14:43:39Z) - Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - A Mechanistic Interpretation of Arithmetic Reasoning in Language Models
using Causal Mediation Analysis [128.0532113800092]
We present a mechanistic interpretation of Transformer-based LMs on arithmetic questions.
This provides insights into how information related to arithmetic is processed by LMs.
arXiv Detail & Related papers (2023-05-24T11:43:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.