Progressive Localisation in Localist LLMs
- URL: http://arxiv.org/abs/2511.18375v2
- Date: Fri, 28 Nov 2025 10:44:50 GMT
- Title: Progressive Localisation in Localist LLMs
- Authors: Joachim Diederich,
- Abstract summary: This paper demonstrates that progressive localization represents the optimal architecture for creating interpretable large language models (LLMs)<n>We investigate whether interpretability constraints can be aligned with natural semantic structure while being applied strategically across network depth.<n>We show that progressive semantic localization, combining semantic block with steep adaptive locality schedules, achieves near-baseline language modeling performance while providing interpretable attention patterns.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper demonstrates that progressive localization, the gradual increase of attention locality from early distributed layers to late localized layers, represents the optimal architecture for creating interpretable large language models (LLMs) while preserving performance. Through systematic experimentation with GPT-2 fine-tuned on The Psychology of Artificial Superintelligence, we evaluate seven locality configurations ranging from fully distributed to strictly localist, with five progressive schedules implementing polynomial increases (linear through quintic). We investigate whether interpretability constraints can be aligned with natural semantic structure while being applied strategically across network depth. We demonstrate that progressive semantic localization, combining adaptive semantic block partitioning with steep polynomial locality schedules, achieves near-baseline language modeling performance while providing interpretable attention patterns. Multiple independent training runs with different random seeds establish that results are statistically robust and highly reproducible. The approach dramatically outperforms both fixed-window localization and naive uniform locality constraints. Analysis reveals that maintaining flexibility through low-fidelity constraints preserves model capacity while providing interpretability benefits, and that steep schedules concentrating locality in decision-critical final layers while preserving distributed learning in early layers achieve near-baseline attention distribution characteristics. These findings demonstrate that interpretability mechanisms should align with semantic structure to achieve practical performance-interpretability tradeoffs for trustworthy AI systems.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training [94.568675548967]
Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain generalization.<n>Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar.<n>We propose DFPO, a robust distributional RL framework that models values as continuous flows across time steps.
arXiv Detail & Related papers (2026-02-05T17:07:42Z) - Improving LLM Reasoning with Homophily-aware Structural and Semantic Text-Attributed Graph Compression [55.51959317490934]
Large language models (LLMs) have demonstrated promising capabilities in Text-Attributed Graph (TAG) understanding.<n>We argue that graphs inherently contain rich structural and semantic information, and that their effective exploitation can unlock potential gains in LLMs reasoning performance.<n>We propose Homophily-aware Structural and Semantic Compression for LLMs (HS2C), a framework centered on exploiting graph homophily.
arXiv Detail & Related papers (2026-01-13T03:35:18Z) - AILA--First Experiments with Localist Language Models [0.0]
This paper presents the first empirical demonstration of controllable locality in transformer language models.<n>We conduct experiments on the WikiText corpus using a two-layer transformer architecture.<n>Prediction experiments reveal that intermediate locality values optimize the tradeoff between interpretability and performance.
arXiv Detail & Related papers (2025-11-05T15:43:54Z) - Localist LLMs with Recruitment Learning [0.0]
We present a novel framework for training large language models with continuously adjustable internal representations.<n>Key innovations are (1) a locality dial, that dynamically controls the degree of localization during both training and inference without requiring model retraining, and (2) an information-theoretic recruitment mechanism that adaptively allocates semantic blocks as needed.
arXiv Detail & Related papers (2025-10-20T09:58:34Z) - Token-Level Inference-Time Alignment for Vision-Language Models [58.41370989069588]
Vision-Language Models (VLMs) have become essential backbones of modern multimodal intelligence.<n>We present TITA, a lightweight framework that freezes the base VLM and instead trains a reward model to approximate its distribution.<n>During inference, implicit preference signals are extracted as log-probability ratios between the reward model and the target VLM, yielding dense autoregressive feedback.
arXiv Detail & Related papers (2025-10-20T09:58:03Z) - Localist LLMs -- A Mathematical Framework for Dynamic Locality Control [0.0]
Key innovation is a locality dial, a tunable parameter that dynamically controls the degree of localization during both training and inference without requiring model retraining.<n>We prove that when group sparsity penalties exceed certain threshold values, the model's attention mechanisms concentrate on semantically relevant blocks, achieving low entropy and high fidelity with negligible error.
arXiv Detail & Related papers (2025-10-10T12:44:59Z) - SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models [73.19077622773075]
We present a comprehensive methodology for building spatial intelligence progressively.<n>We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks.<n>We design a three-stage progressive training framework that establishes spatial perception through object localization, develops spatial understanding through multi-dimensional spatial tasks, and strengthens complex reasoning via reinforcement learning with verifiable rewards.
arXiv Detail & Related papers (2025-10-09T17:50:54Z) - PDE Solvers Should Be Local: Fast, Stable Rollouts with Learned Local Stencils [20.49015396991881]
We present FINO, a finite-difference-inspired neural architecture that enforces strict locality.<n>FINO replaces fixed finite-difference stencil coefficients with learnable convolutional kernels.<n>It achieves up to 44% lower error and up to around 2times speedups over state-of-the-art operator-learning baselines.
arXiv Detail & Related papers (2025-09-30T12:42:32Z) - Boosting Neural Language Inference via Cascaded Interactive Reasoning [38.125341836302525]
Natural Language Inference (NLI) focuses on ascertaining the logical relationship between a given premise and hypothesis.<n>This task presents significant challenges due to inherent linguistic features such as diverse phrasing, semantic complexity, and contextual nuances.<n>We introduce the Cascaded Interactive Reasoning Network (CIRN), a novel architecture designed for deeper semantic comprehension in NLI.
arXiv Detail & Related papers (2025-05-10T11:37:15Z) - Stochastic Layer-wise Learning: Scalable and Efficient Alternative to Backpropagation [1.0285749562751982]
Backpropagation underpins modern deep learning, yet its reliance on global synchronization limits scalability and incurs high memory costs.<n>In contrast, fully local learning rules are more efficient but often struggle to maintain the cross-layer coordination needed for coherent global learning.<n>We introduce Layer-wise Learning (SLL), a layer-wise training algorithm that decomposes the global objective into coordinated layer-local updates.
arXiv Detail & Related papers (2025-05-08T12:32:29Z) - The Remarkable Robustness of LLMs: Stages of Inference? [5.346230590800585]
We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference.<n>Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any fine-tuning.
arXiv Detail & Related papers (2024-06-27T17:57:03Z) - Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - Adaptive Global-Local Representation Learning and Selection for
Cross-Domain Facial Expression Recognition [54.334773598942775]
Domain shift poses a significant challenge in Cross-Domain Facial Expression Recognition (CD-FER)
We propose an Adaptive Global-Local Representation Learning and Selection framework.
arXiv Detail & Related papers (2024-01-20T02:21:41Z) - Understanding How Consistency Works in Federated Learning via Stage-wise
Relaxed Initialization [84.42306265220274]
Federated learning (FL) is a distributed paradigm that coordinates massive local clients to collaboratively train a global model.
Previous works have implicitly studied that FL suffers from the client-drift'' problem, which is caused by the inconsistent optimum across local clients.
To alleviate the negative impact of the client drift'' and explore its substance in FL, we first design an efficient FL algorithm textitFedInit.
arXiv Detail & Related papers (2023-06-09T06:55:15Z) - Manifold-Aware Self-Training for Unsupervised Domain Adaptation on
Regressing 6D Object Pose [69.14556386954325]
Domain gap between synthetic and real data in visual regression is bridged in this paper via global feature alignment and local refinement.
Our method incorporates an explicit self-supervised manifold regularization, revealing consistent cumulative target dependency across domains.
Learning unified implicit neural functions to estimate relative direction and distance of targets to their nearest class bins aims to refine target classification predictions.
arXiv Detail & Related papers (2023-05-18T08:42:41Z) - Delving into Sequential Patches for Deepfake Detection [64.19468088546743]
Recent advances in face forgery techniques produce nearly untraceable deepfake videos, which could be leveraged with malicious intentions.
Previous studies has identified the importance of local low-level cues and temporal information in pursuit to generalize well across deepfake methods.
We propose the Local- & Temporal-aware Transformer-based Deepfake Detection framework, which adopts a local-to-global learning protocol.
arXiv Detail & Related papers (2022-07-06T16:46:30Z) - Edge-assisted Democratized Learning Towards Federated Analytics [67.44078999945722]
We show the hierarchical learning structure of the proposed edge-assisted democratized learning mechanism, namely Edge-DemLearn.
We also validate Edge-DemLearn as a flexible model training mechanism to build a distributed control and aggregation methodology in regions.
arXiv Detail & Related papers (2020-12-01T11:46:03Z) - Second-Order Guarantees in Centralized, Federated and Decentralized
Nonconvex Optimization [64.26238893241322]
Simple algorithms have been shown to lead to good empirical results in many contexts.
Several works have pursued rigorous analytical justification for studying non optimization problems.
A key insight in these analyses is that perturbations play a critical role in allowing local descent algorithms.
arXiv Detail & Related papers (2020-03-31T16:54:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.