Related papers: Mixture of Experts Meets Prompt-Based Continual Learning

Mixture of Experts Meets Prompt-Based Continual Learning

URL: http://arxiv.org/abs/2405.14124v4
Date: Sat, 04 Jan 2025 04:04:50 GMT
Title: Mixture of Experts Meets Prompt-Based Continual Learning
Authors: Minh Le, An Nguyen, Huy Nguyen, Trang Nguyen, Trang Pham, Linh Van Ngo, Nhat Ho,
Abstract summary: This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning.<n>We provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism.<n>The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms.
Score: 23.376460019465235
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exploiting the power of pre-trained models, prompt-based approaches stand out compared to other continual learning solutions in effectively preventing catastrophic forgetting, even with very few learnable parameters and without the need for a memory buffer. While existing prompt-based continual learning methods excel in leveraging prompts for state-of-the-art performance, they often lack a theoretical explanation for the effectiveness of prompting. This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning, thus offering a new perspective on prompt design. We first show that the attention block of pre-trained models like Vision Transformers inherently encodes a special mixture of experts architecture, characterized by linear experts and quadratic gating score functions. This realization drives us to provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism termed Non-linear Residual Gates (NoRGa). Through the incorporation of non-linear activation and residual connection, NoRGa enhances continual learning performance while preserving parameter efficiency. The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms. Our code is publicly available at https://github.com/Minhchuyentoancbn/MoE_PromptCL

Related papers

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model. Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z)
Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL) This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z)
LSPT: Long-term Spatial Prompt Tuning for Visual Representation Learning [36.843950725332476]
Visual Prompt Tuning (VPT) techniques adapt pre-trained Vision Transformers (ViTs) to downstream visual tasks using specialized learnable tokens termed as prompts. We introduce Long-term Spatial Prompt Tuning (LSPT) - a revolutionary approach to visual representation learning. Our empirical findings underscore the superiority of LSPT, showcasing its ability to set new benchmarks in visual prompt tuning performance.
arXiv Detail & Related papers (2024-02-27T10:55:07Z)
Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice. HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics. Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z)
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications [63.29358103217275]
Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression. We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
arXiv Detail & Related papers (2023-10-02T03:12:06Z)
PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine [24.888093229577965]
We propose a simple, universal, and automatic method named PREFER to address the stated limitations. Our PREFER achieves state-of-the-art performance in multiple types of tasks by a significant margin.
arXiv Detail & Related papers (2023-08-23T09:46:37Z)
On the Role of Attention in Prompt-tuning [90.97555030446563]
We study prompt-tuning for one-layer attention architectures and study contextual mixture-models. We show that softmax-prompt-attention is provably more expressive than softmax-self-attention and linear-prompt-attention. We also provide experiments that verify our theoretical insights on real datasets and demonstrate how prompt-tuning enables the model to attend to context-relevant information.
arXiv Detail & Related papers (2023-06-06T06:23:38Z)
CODA-Prompt: COntinual Decomposed Attention-based Prompting for Rehearsal-Free Continual Learning [30.676509834338884]
Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. We propose prompting approaches as an alternative to data-rehearsal. We show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 4.5% in average final accuracy.
arXiv Detail & Related papers (2022-11-23T18:57:11Z)
Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models [108.13378788663196]
We propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process. We equip CoOp with Novel Learner Feature (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set.
arXiv Detail & Related papers (2022-11-04T02:06:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.