One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning
- URL: http://arxiv.org/abs/2509.24483v1
- Date: Mon, 29 Sep 2025 08:54:58 GMT
- Title: One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning
- Authors: Minh Le, Bao-Ngoc Dao, Huy Nguyen, Quyen Tran, Anh Nguyen, Nhat Ho,
- Abstract summary: We propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies.<n>SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches.
- Score: 52.966712416640085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple "prompt experts" within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.
Related papers
- Is Parameter Isolation Better for Prompt-Based Continual Learning? [46.254917907419895]
Most existing methods assign a fixed set of prompts to each task, isolating knowledge across tasks and resulting in suboptimal parameter utilization.<n>This framework constructs a global prompt pool and introduces a task-aware gated routing mechanism that sparsely activates a subset of prompts.<n>We also introduce a history-aware modulator that leverages cumulative prompt activation statistics to protect frequently used prompts from excessive updates.
arXiv Detail & Related papers (2026-01-28T08:17:11Z) - Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning [10.01449025634975]
Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma.<n>We introduce SETA, a framework that resolves the plasticity-stability conflict by decomposing the model into modular subspaces.<n>We show that SETA consistently outperforms state-of-the-art parameter-efficient fine-tuning-based continual learning methods.
arXiv Detail & Related papers (2026-01-24T22:39:22Z) - Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder [59.89996751196727]
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models.<n>SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs.<n>Recent Mixture of Experts (MoE) approaches attempt to address this by SAEs into narrower expert networks with gated activation.<n>We propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling.
arXiv Detail & Related papers (2025-11-07T22:19:34Z) - GMoPE:A Prompt-Expert Mixture Framework for Graph Foundation Models [30.023472202549076]
Graph Neural Networks (GNNs) have demonstrated impressive performance on task-specific benchmarks, yet their ability to generalize across diverse domains and tasks remains limited.<n>We propose GMoPE, a framework that seamlessly integrates the Mixture-of-Experts (MoE) architecture with prompt-based learning for graphs.<n>We show that GMoPE consistently outperforms state-of-the-art baselines and achieves performance comparable to full parameter fine-tuning.
arXiv Detail & Related papers (2025-11-05T07:28:51Z) - Multi-Task Dense Prediction Fine-Tuning with Mixture of Fine-Grained Experts [22.936728143586443]
Multi-task learning (MTL) for dense prediction has shown promising results but still faces challenges in balancing shared representations with task-specific specialization.<n>We introduce a novel Fine-Grained Mixture of Experts architecture that explores MoE-based MTL models through a combination of three key innovations and fine-tuning.
arXiv Detail & Related papers (2025-07-25T08:59:30Z) - Adaptive Prompting for Continual Relation Extraction: A Within-Task Variance Perspective [23.79259400522239]
We propose a novel approach to address catastrophic forgetting in Continual Relation Extraction.<n>Our approach employs a prompt pool for each task, capturing variations within each task while enhancing cross-task variances.
arXiv Detail & Related papers (2024-12-11T11:00:33Z) - Complexity Experts are Task-Discriminative Learners for Any Image Restoration [80.46313715427928]
We introduce complexity experts" -- flexible expert blocks with varying computational complexity and receptive fields.<n>This preference effectively drives task-specific allocation, assigning tasks to experts with the appropriate complexity.<n>The proposed MoCE-IR model outperforms state-of-the-art methods, affirming its efficiency and practical applicability.
arXiv Detail & Related papers (2024-11-27T15:58:07Z) - Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - PECTP: Parameter-Efficient Cross-Task Prompts for Incremental Vision Transformer [76.39111896665585]
Incremental Learning (IL) aims to learn deep models on sequential tasks continually.
Recent vast pre-trained models (PTMs) have achieved outstanding performance by prompt technique in practical IL without the old samples.
arXiv Detail & Related papers (2024-07-04T10:37:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.