Related papers: Distributed Specialization: Rare-Token Neurons in Large Language Models

Distributed Specialization: Rare-Token Neurons in Large Language Models

URL: http://arxiv.org/abs/2509.21163v1
Date: Thu, 25 Sep 2025 13:49:38 GMT
Title: Distributed Specialization: Rare-Token Neurons in Large Language Models
Authors: Jing Liu, Haozheng Wang, Yueheng Li,
Abstract summary: Large language models (LLMs) struggle with representing and generating rare tokens despite their importance in specialized domains.<n>We investigate whether LLMs develop internal specialization mechanisms through discrete modular architectures or distributed parameter-level differentiation domains.
Score: 8.13000021263958
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) struggle with representing and generating rare tokens despite their importance in specialized domains. We investigate whether LLMs develop internal specialization mechanisms through discrete modular architectures or distributed parameter-level differentiation. Through systematic analysis of final-layer MLP neurons across multiple model families, we discover that rare-token processing emerges via \textit{distributed specialization}: functionally coordinated but spatially distributed subnetworks that exhibit three distinct organizational principles. First, we identify a reproducible three-regime influence hierarchy comprising highly influential plateau neurons(also termed as rare-token neurons), power-law decay neurons, and minimally contributing neurons, which is absent in common-token processing. Second, plateau neurons demonstrate coordinated activation patterns (reduced effective dimensionality) while remaining spatially distributed rather than forming discrete clusters. Third, these specialized mechanisms are universally accessible through standard attention pathways without requiring dedicated routing circuits. Training dynamics reveal that functional specialization emerges gradually through parameter differentiation, with specialized neurons developing increasingly heavy-tailed weight correlation spectra consistent with Heavy-Tailed Self-Regularization signatures. Our findings establish that LLMs process rare-tokens through distributed coordination within shared architectures rather than mixture-of-experts-style modularity. These results provide insights for interpretable model editing, computational efficiency optimization, and understanding emergent functional organization in transformer networks.

Related papers

Fractional neural attention for efficient multiscale sequence processing [0.0]
We introduce Fractional Neural Attention (FNA), a principled framework for multiscale information processing.<n>FNA models token interactions through Lévy diffusion governed by the fractional Laplacian.<n>FNA achieves competitive text-classification performance even with a single layer and a single head.
arXiv Detail & Related papers (2025-11-13T11:27:39Z)
Neuronal Group Communication for Efficient Neural representation [85.36421257648294]
This paper addresses the question of how to build large neural systems that learn efficient, modular, and interpretable representations.<n>We propose Neuronal Group Communication (NGC), a theory-driven framework that reimagines a neural network as a dynamical system of interacting neuronal groups.<n>NGC treats weights as transient interactions between embedding-like neuronal states, with neural computation unfolding through iterative communication among groups of neurons.
arXiv Detail & Related papers (2025-10-19T14:23:35Z)
Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms [55.1784306456972]
Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
arXiv Detail & Related papers (2025-09-28T15:13:38Z)
No Clustering, No Routing: How Transformers Actually Process Rare Tokens [6.581088182267414]
Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear.<n>We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models.
arXiv Detail & Related papers (2025-08-30T22:20:41Z)
Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models [68.57424628540907]
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets.<n>We introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms.<n>Our method employs Integrated Gradients to quantify each neuron's influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance.
arXiv Detail & Related papers (2025-07-12T08:10:10Z)
CodeBrain: Towards Decoupled Interpretability and Multi-Scale Architecture for EEG Foundation Model [52.466542039411515]
EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models.<n>We present CodeBrain, a two-stage EFM designed to fill this gap.<n>In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens.<n>In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention.
arXiv Detail & Related papers (2025-06-10T17:20:39Z)
Emergent Specialization: Rare Token Neurons in Language Models [5.946977198458224]
Large language models struggle with representing and generating rare tokens despite their importance in specialized domains.<n>In this study, we identify neuron structures with exceptionally strong influence on language model's prediction of rare tokens, termed as rare token neurons.
arXiv Detail & Related papers (2025-05-19T08:05:13Z)
Astrocyte-mediated hierarchical modulation enables learning-to-learn in recurrent spiking networks [20.88195975299024]
A central feature of biological intelligence is the ability to learn to learn, enabling rapid adaptation to novel tasks and environments.<n>Inspired by astrocyte-mediated neuromodulation, we propose a hierarchically modulated neural network (HM-RSNN) that models learning-to-learn.<n>We evaluate HM-RSNN on four cognitive tasks, demonstrating its computational advantages over standard RSNNs and artificial neural networks.
arXiv Detail & Related papers (2025-01-24T14:45:03Z)
Spatial embedding promotes a specific form of modularity with low entropy and heterogeneous spectral dynamics [0.0]
Spatially embedded recurrent neural networks provide a promising avenue to study how modelled constraints shape the combined structural and functional organisation of networks over learning. We show that it is possible to study these restrictions through entropic measures of the neural weights and eigenspectrum, across both rate and spiking neural networks. This work deepens our understanding of constrained learning in neural networks, across coding schemes and tasks, where solutions to simultaneous structural and functional objectives must be accomplished in tandem.
arXiv Detail & Related papers (2024-09-26T10:00:05Z)
Interpretable Spatio-Temporal Embedding for Brain Structural-Effective Network with Ordinary Differential Equation [56.34634121544929]
In this study, we first construct the brain-effective network via the dynamic causal model. We then introduce an interpretable graph learning framework termed Spatio-Temporal Embedding ODE (STE-ODE) This framework incorporates specifically designed directed node embedding layers, aiming at capturing the dynamic interplay between structural and effective networks.
arXiv Detail & Related papers (2024-05-21T20:37:07Z)
Learning Multiscale Consistency for Self-supervised Electron Microscopy Instance Segmentation [48.267001230607306]
We propose a pretraining framework that enhances multiscale consistency in EM volumes. Our approach leverages a Siamese network architecture, integrating strong and weak data augmentations. It effectively captures voxel and feature consistency, showing promise for learning transferable representations for EM analysis.
arXiv Detail & Related papers (2023-08-19T05:49:13Z)
The Expressive Leaky Memory Neuron: an Efficient and Expressive Phenomenological Neuron Model Can Solve Long-Horizon Tasks [64.08042492426992]
We introduce the Expressive Memory (ELM) neuron model, a biologically inspired model of a cortical neuron. Our ELM neuron can accurately match the aforementioned input-output relationship with under ten thousand trainable parameters. We evaluate it on various tasks with demanding temporal structures, including the Long Range Arena (LRA) datasets.
arXiv Detail & Related papers (2023-06-14T13:34:13Z)
Data-driven emergence of convolutional structure in neural networks [83.4920717252233]
We show how fully-connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs. By carefully designing data models, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs.
arXiv Detail & Related papers (2022-02-01T17:11:13Z)
Theory of gating in recurrent neural networks [5.672132510411465]
Recurrent neural networks (RNNs) are powerful dynamical models, widely used in machine learning (ML) and neuroscience. Here, we show that gating offers flexible control of two salient features of the collective dynamics. The gate controlling timescales leads to a novel, marginally stable state, where the network functions as a flexible integrator.
arXiv Detail & Related papers (2020-07-29T13:20:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.