Divide and not forget: Ensemble of selectively trained experts in Continual Learning
- URL: http://arxiv.org/abs/2401.10191v3
- Date: Tue, 19 Mar 2024 14:09:31 GMT
- Title: Divide and not forget: Ensemble of selectively trained experts in Continual Learning
- Authors: Grzegorz Rypeść, Sebastian Cygert, Valeriya Khan, Tomasz Trzciński, Bartosz Zieliński, Bartłomiej Twardowski,
- Abstract summary: Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know.
A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task.
SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert.
- Score: 0.2886273197127056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Class-incremental learning is becoming more popular as it helps models widen their applicability while not forgetting what they already know. A trend in this area is to use a mixture-of-expert technique, where different models work together to solve the task. However, the experts are usually trained all at once using whole task data, which makes them all prone to forgetting and increasing computational burden. To address this limitation, we introduce a novel approach named SEED. SEED selects only one, the most optimal expert for a considered task, and uses data from this task to fine-tune only this expert. For this purpose, each expert represents each class with a Gaussian distribution, and the optimal expert is selected based on the similarity of those distributions. Consequently, SEED increases diversity and heterogeneity within the experts while maintaining the high stability of this ensemble method. The extensive experiments demonstrate that SEED achieves state-of-the-art performance in exemplar-free settings across various scenarios, showing the potential of expert diversification through data in continual learning.
Related papers
- Mixture of Efficient Diffusion Experts Through Automatic Interval and Sub-Network Selection [63.96018203905272]
We propose to reduce the sampling cost by pruning a pretrained diffusion model into a mixture of efficient experts.
We demonstrate the effectiveness of our method, DiffPruning, across several datasets.
arXiv Detail & Related papers (2024-09-23T21:27:26Z) - Diversified Batch Selection for Training Acceleration [68.67164304377732]
A prevalent research line, known as online batch selection, explores selecting informative subsets during the training process.
vanilla reference-model-free methods involve independently scoring and selecting data in a sample-wise manner.
We propose Diversified Batch Selection (DivBS), which is reference-model-free and can efficiently select diverse and representative samples.
arXiv Detail & Related papers (2024-06-07T12:12:20Z) - Learning More Generalized Experts by Merging Experts in Mixture-of-Experts [0.5221459608786241]
We show that incorporating a shared layer in a mixture-of-experts can lead to performance degradation.
We merge the two most frequently selected experts and update the least frequently selected expert using the combination of experts.
Our algorithm enhances transfer learning and mitigates catastrophic forgetting when applied to multi-domain task incremental learning.
arXiv Detail & Related papers (2024-05-19T11:55:48Z) - Learning to Route Among Specialized Experts for Zero-Shot Generalization [39.56470042680907]
We propose Post-Hoc Adaptive Tokenwise Gating Over an Ocean of Specialized Experts (PHATGOOSE)
It learns to route among specialized modules that were produced through parameter-efficient fine-tuning.
It does not require simultaneous access to the datasets used to create the specialized models and only requires a modest amount of additional compute after each expert model is trained.
arXiv Detail & Related papers (2024-02-08T17:43:22Z) - Inverse Reinforcement Learning with Sub-optimal Experts [56.553106680769474]
We study the theoretical properties of the class of reward functions that are compatible with a given set of experts.
Our results show that the presence of multiple sub-optimal experts can significantly shrink the set of compatible rewards.
We analyze a uniform sampling algorithm that results in being minimax optimal whenever the sub-optimal experts' performance level is sufficiently close to the one of the optimal agent.
arXiv Detail & Related papers (2024-01-08T12:39:25Z) - Active Ranking of Experts Based on their Performances in Many Tasks [72.96112117037465]
We consider the problem of ranking n experts based on their performances on d tasks.
We make a monotonicity assumption stating that for each pair of experts, one outperforms the other on all tasks.
arXiv Detail & Related papers (2023-06-05T06:55:39Z) - Diversified Dynamic Routing for Vision Tasks [36.199659460868496]
We propose a novel architecture where each layer is composed of a set of experts.
In our method, the model is explicitly trained to solve the challenge of finding relevant partitioning of the data.
We conduct several experiments on semantic segmentation on Cityscapes and object detection and instance segmentation on MS-COCO.
arXiv Detail & Related papers (2022-09-26T23:27:51Z) - Balanced Product of Calibrated Experts for Long-Tailed Recognition [13.194151879344487]
Many real-world recognition problems are characterized by long-tailed label distributions.
In this work, we take an analytical approach and extend the notion of logit adjustment to ensembles to form a Balanced Product of Experts (BalPoE)
We show how to properly define these distributions and combine the experts in order to achieve unbiased predictions.
arXiv Detail & Related papers (2022-06-10T17:59:02Z) - Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse
Experts with Self-Supervision [85.07855130048951]
We study a more practical task setting, called test-agnostic long-tailed recognition, where the training class distribution is long-tailed.
We propose a new method, called Test-time Aggregating Diverse Experts (TADE), that trains diverse experts to excel at handling different test distributions.
We theoretically show that our method has provable ability to simulate unknown test class distributions.
arXiv Detail & Related papers (2021-07-20T04:10:31Z) - Learning From Multiple Experts: Self-paced Knowledge Distillation for
Long-tailed Classification [106.08067870620218]
We propose a self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME)
We refer to these models as 'Experts', and the proposed LFME framework aggregates the knowledge from multiple 'Experts' to learn a unified student model.
We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-06T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.