SCALE: Upscaled Continual Learning of Large Language Models
- URL: http://arxiv.org/abs/2511.03270v1
- Date: Wed, 05 Nov 2025 08:05:50 GMT
- Title: SCALE: Upscaled Continual Learning of Large Language Models
- Authors: Jin-woo Lee, Junhwa Choi, Bongkyu Hwang, Jinho Choo, Bogun Kim, JeongSeon Yi, Joonseok Lee, DongYoung Jung, Jaeseon Park, Kyoungwon Park, Suk-hoon Jung,
- Abstract summary: We introduce a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters.<n>This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality.<n>Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization.
- Score: 15.0007740485038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model's behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.
Related papers
- Advancing Analytic Class-Incremental Learning through Vision-Language Calibration [6.871141687303144]
Class-incremental learning (CIL) with pre-trained models (PTMs) faces a critical trade-off between efficient adaptation and long-term stability.<n>We propose textbfVILA, a novel dual-branch framework that advances analytic CIL via a two-level vision-language calibration strategy.<n>Our framework harmonizes high-fidelity prediction with the simplicity of analytic learning.
arXiv Detail & Related papers (2026-02-14T08:32:51Z) - Steering Vision-Language Pre-trained Models for Incremental Face Presentation Attack Detection [62.89126207012712]
Face Presentation Attack Detection (PAD) demands incremental learning to combat spoofing tactics and domains.<n>Privacy regulations forbid retaining past data, necessitating rehearsal-free learning (RF-IL)
arXiv Detail & Related papers (2025-12-22T04:30:11Z) - The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z) - EKPC: Elastic Knowledge Preservation and Compensation for Class-Incremental Learning [53.88000987041739]
Class-Incremental Learning (CIL) aims to enable AI models to continuously learn from sequentially arriving data of different classes over time.<n>We propose the Elastic Knowledge Preservation and Compensation (EKPC) method, integrating Importance-aware importance Regularization (IPR) and Trainable Semantic Drift Compensation (TSDC) for CIL.
arXiv Detail & Related papers (2025-06-14T05:19:58Z) - Sculpting [CLS] Features for Pre-Trained Model-Based Class-Incremental Learning [3.73232466691291]
Class-incremental learning requires models to continually acquire knowledge of new classes without forgetting old ones.<n>Although pre-trained models have demonstrated strong performance in class-incremental learning, they remain susceptible to catastrophic forgetting when learning new concepts.<n>We introduce a new parameter-efficient fine-tuning module 'Learn and Calibrate', or LuCA, designed to acquire knowledge through an adapter-calibrator couple.<n>For each learning session, we deploy a sparse LuCA module on top of the last token, which we refer to as 'Token-level Sparse and Adaptation', or TO
arXiv Detail & Related papers (2025-02-20T17:37:08Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Towards Continual Learning Desiderata via HSIC-Bottleneck
Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion.
Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z) - Towards Robust Pruning: An Adaptive Knowledge-Retention Pruning Strategy
for Language Models [35.58379464827462]
We introduce a post-training pruning strategy designed to faithfully replicate the embedding space and feature space of dense language models.
Compared to other state-of-art baselines, our approach demonstrates a superior balance between accuracy, sparsity, robustness, and pruning cost with BERT on datasets SST2, IMDB, and AGNews.
arXiv Detail & Related papers (2023-10-19T23:02:29Z) - Adversarial Self-Attention for Language Understanding [89.265747130584]
This paper proposes textitAdversarial Self-Attention mechanism (ASA).
ASA adversarially reconstructs the Transformer attentions and facilitates model training from contaminated model structures.
For fine-tuning, ASA-empowered models consistently outweigh naive models by a large margin considering both generalization and robustness.
arXiv Detail & Related papers (2022-06-25T09:18:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.