Related papers: No Clustering, No Routing: How Transformers Actually Process Rare Tokens

No Clustering, No Routing: How Transformers Actually Process Rare Tokens

URL: http://arxiv.org/abs/2509.04479v1
Date: Sat, 30 Aug 2025 22:20:41 GMT
Title: No Clustering, No Routing: How Transformers Actually Process Rare Tokens
Authors: Jing Liu,
Abstract summary: Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear.<n>We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models.
Score: 6.581088182267414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear. Prior work identified specialized ``plateau'' neurons for rare tokens following distinctive three-regime influence patterns \cite{liu2025emergent}, but their functional organization is unknown. We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models. Our findings show that: (1) rare token processing requires additional plateau neurons beyond the power-law regime sufficient for common tokens, forming dual computational regimes; (2) plateau neurons are spatially distributed rather than forming modular clusters; and (3) attention mechanisms exhibit no preferential routing to specialists. These results demonstrate that rare token specialization arises through distributed, training-driven differentiation rather than architectural modularity, preserving context-sensitive flexibility while achieving adaptive capacity allocation.

Related papers

Distributed Specialization: Rare-Token Neurons in Large Language Models [8.13000021263958]
Large language models (LLMs) struggle with representing and generating rare tokens despite their importance in specialized domains.<n>We investigate whether LLMs develop internal specialization mechanisms through discrete modular architectures or distributed parameter-level differentiation domains.
arXiv Detail & Related papers (2025-09-25T13:49:38Z)
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z)
Dreaming up scale invariance via inverse renormalization group [0.0]
We show how minimal neural networks can invert the renormalization group (RG) coarse-graining procedure in the two-dimensional Ising model.<n>We demonstrate that even neural networks with as few as three trainable parameters can learn to generate critical configurations.
arXiv Detail & Related papers (2025-06-04T14:46:22Z)
Emergent Specialization: Rare Token Neurons in Language Models [5.946977198458224]
Large language models struggle with representing and generating rare tokens despite their importance in specialized domains.<n>In this study, we identify neuron structures with exceptionally strong influence on language model's prediction of rare tokens, termed as rare token neurons.
arXiv Detail & Related papers (2025-05-19T08:05:13Z)
Benign Overfitting in Token Selection of Attention Mechanism [34.316270145027616]
We study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise.<n>We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting.<n>Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting.
arXiv Detail & Related papers (2024-09-26T08:20:05Z)
Coding schemes in neural networks learning classification tasks [52.22978725954347]
We investigate fully-connected, wide neural networks learning classification tasks. We show that the networks acquire strong, data-dependent features. Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity.
arXiv Detail & Related papers (2024-06-24T14:50:05Z)
Confidence Regulation Neurons in Language Models [91.90337752432075]
This study investigates the mechanisms by which large language models represent and regulate uncertainty in next-token predictions. Entropy neurons are characterized by an unusually high weight norm and influence the final layer normalization (LayerNorm) scale to effectively scale down the logits. token frequency neurons, which we describe here for the first time, boost or suppress each token's logit proportionally to its log frequency, thereby shifting the output distribution towards or away from the unigram distribution.
arXiv Detail & Related papers (2024-06-24T01:31:03Z)
TANGOS: Regularizing Tabular Neural Networks through Gradient Orthogonalization and Specialization [69.80141512683254]
We introduce Tabular Neural Gradient Orthogonalization and gradient (TANGOS) TANGOS is a novel framework for regularization in the tabular setting built on latent unit attributions. We demonstrate that our approach can lead to improved out-of-sample generalization performance, outperforming other popular regularization methods.
arXiv Detail & Related papers (2023-03-09T18:57:13Z)
On the Interpretability of Regularisation for Neural Networks Through Model Gradient Similarity [0.0]
Model Gradient Similarity (MGS) serves as a metric of regularisation. MGS provides the basis for a new regularisation scheme which exhibits excellent performance.
arXiv Detail & Related papers (2022-05-25T10:38:33Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
Self-Supervised Graph Representation Learning for Neuronal Morphologies [75.38832711445421]
We present GraphDINO, a data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled datasets. We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings on par with manual feature-based classification by experts. Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets.
arXiv Detail & Related papers (2021-12-23T12:17:47Z)
Formation of cell assemblies with iterative winners-take-all computation and excitation-inhibition balance [0.0]
We present an intermediate model that shares the computational ease of kWTA and has more flexible and richer dynamics. We investigate Hebbian-like learning rules and propose a new learning rule for binary weights with multiple stabilization mechanisms.
arXiv Detail & Related papers (2021-08-02T08:20:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.