Related papers: Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns

URL: http://arxiv.org/abs/2602.22479v1
Date: Wed, 25 Feb 2026 23:38:16 GMT
Title: Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns
Authors: Afshin Khadangi,
Abstract summary: We introduce TRC$2$ (Thalamically Routed Cortical Columns), a decoder-only backbone that addresses continual learning at the architectural level.<n>The resulting block is sparse and chunk-parallel, enabling efficient training and inference while preserving clean ablations of each subsystem.
Score: 0.16921396880325779
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Continual learning is a core requirement for deployed language models, yet standard training and fine-tuning pipelines remain brittle under non-stationary data. Online updates often induce catastrophic forgetting, while methods that improve stability frequently increase latency, memory footprint, or dense computation in ways that do not scale well to long contexts. We introduce TRC$^{2}$ (Thalamically Routed Cortical Columns), a decoder-only backbone that addresses continual learning at the architectural level. TRC$^{2}$ combines sparse thalamic routing over cortical columns with mechanisms for modulation, prediction, memory, and feedback, together with a fast corrective pathway that supports rapid adaptation without destabilizing slower parameters. The resulting block is sparse and chunk-parallel, enabling efficient training and inference while preserving clean ablations of each subsystem. We instantiate a reproducible training and evaluation stack and a continual-learning harness that measures proxy forgetting under streaming domain shifts. Across language modeling and continual learning benchmarks, TRC$^{2}$ improves the stability-plasticity tradeoff at comparable compute, enabling rapid on-stream adaptation while preserving previously acquired behavior.

Related papers

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes [10.877713536966601]
Longestahead Prefix (LSP) scheduler is a training-free and model-agnostic inference paradigm based on monolithic prefix absorption.<n>LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions.<n>It snaps its boundary to natural linguistic or structural acceptances before an atomic commitment.
arXiv Detail & Related papers (2026-03-05T18:25:26Z)
When Learning Hurts: Fixed-Pole RNN for Real-Time Online Training [58.25341036646294]
We analytically examine why learning recurrent poles does not provide tangible benefits in data and empirically offer real-time learning scenarios.<n>We show that fixed-pole networks achieve superior performance with lower training complexity, making them more suitable for online real-time tasks.
arXiv Detail & Related papers (2026-02-25T00:15:13Z)
Pretraining with Token-Level Adaptive Latent Chain-of-Thought [44.19871205975474]
Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs.<n>This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining.<n>We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token.<n>Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FL
arXiv Detail & Related papers (2026-02-09T02:49:15Z)
Trust Region Continual Learning as an Implicit Meta-Learner [3.705371747297478]
We study a hybrid perspective: emphtrust region continual learning that combines generative replay with a Fisher-metric trust region constraint.<n>We show that, under local approximations, the resulting update admits a MAML-style interpretation with a single implicit inner step.<n>This yields an emergent meta-learning property in continual learning.
arXiv Detail & Related papers (2026-02-02T18:19:16Z)
FOREVER: Forgetting Curve-Inspired Memory Replay for Language Model Continual Learning [63.20028888397869]
FOREVER (FORgEtting curVe-inspired mEmory) is a novel framework that aligns replay schedules with a model-centric notion of time.<n>Building on this approach, FOREVER incorporates a forgetting curve-based replay scheduler to determine when to replay and an intensity-aware regularization mechanism to adaptively control how to replay.
arXiv Detail & Related papers (2026-01-07T13:55:14Z)
Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data [89.96277093034547]
We introduce EntroDrop, an entropy-guided token dropout method that functions as structured data regularization.<n>We show that EntroDrop consistently outperforms standard regularization baselines and maintains robust performance throughout extended multi-epoch training.
arXiv Detail & Related papers (2025-12-29T12:35:51Z)
TNT: Improving Chunkwise Training for Test-Time Memorization [62.78875147721906]
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers.<n>We introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process.<n>TNT achieves a substantial acceleration in training speed-up to 17 times faster than the most accurate baseline configuration.
arXiv Detail & Related papers (2025-11-10T17:45:09Z)
SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling [9.936731043466699]
Training expressive flow-based policies with off-policy reinforcement learning is notoriously unstable due to gradient pathologies in the multi-step action sampling process.<n>We trace this instability to a fundamental connection: the flow rollout is algebraically equivalent to a residual recurrent computation, making it susceptible to the same vanishing and exploding gradients as RNNs.<n>We develop a practical SAC-based algorithm, enabled by a noise-augmented rollout, that facilitates direct end-to-end training of these policies.
arXiv Detail & Related papers (2025-09-30T04:21:20Z)
TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning [79.59753528758361]
We present TrajBooster, a cross-embodiment framework that leverages abundant wheeled-humanoid data to boost bipedal VLA.<n>Our key idea is to use end-effector trajectories as a morphology-agnostic interface.<n>Results show that TrajBooster allows existing wheeled-humanoid data to efficiently strengthen bipedal humanoid VLA performance.
arXiv Detail & Related papers (2025-09-15T12:25:39Z)
Large Continual Instruction Assistant [59.585544987096974]
Continual Instruction Tuning (CIT) is adopted to instruct Large Models to follow human intent data by data.<n>Existing update gradient would heavily destroy the performance on previous datasets during CIT process.<n>We propose a general continual instruction tuning framework to address the challenge.
arXiv Detail & Related papers (2024-10-08T11:24:59Z)
Transformers for Supervised Online Continual Learning [11.270594318662233]
We propose a method that leverages transformers' in-context learning capabilities for online continual learning. Our method demonstrates significant improvements over previous state-of-the-art results on CLOC, a challenging large-scale real-world benchmark for image geo-localization.
arXiv Detail & Related papers (2024-03-03T16:12:20Z)
Towards Continual Learning Desiderata via HSIC-Bottleneck Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion. Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.