Routing Networks with Co-training for Continual Learning
- URL: http://arxiv.org/abs/2009.04381v1
- Date: Wed, 9 Sep 2020 15:58:51 GMT
- Title: Routing Networks with Co-training for Continual Learning
- Authors: Mark Collier, Efi Kokiopoulou, Andrea Gesmundo, Jesse Berent
- Abstract summary: We propose the use of sparse routing networks for continual learning.
For each input, these network architectures activate a different path through a network of experts.
In practice, we find it is necessary to develop a new training method for routing networks, which we call co-training.
- Score: 5.957609459173546
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The core challenge with continual learning is catastrophic forgetting, the
phenomenon that when neural networks are trained on a sequence of tasks they
rapidly forget previously learned tasks. It has been observed that catastrophic
forgetting is most severe when tasks are dissimilar to each other. We propose
the use of sparse routing networks for continual learning. For each input,
these network architectures activate a different path through a network of
experts. Routing networks have been shown to learn to route similar tasks to
overlapping sets of experts and dissimilar tasks to disjoint sets of experts.
In the continual learning context this behaviour is desirable as it minimizes
interference between dissimilar tasks while allowing positive transfer between
related tasks. In practice, we find it is necessary to develop a new training
method for routing networks, which we call co-training which avoids poorly
initialized experts when new tasks are presented. When combined with a small
episodic memory replay buffer, sparse routing networks with co-training
outperform densely connected networks on the MNIST-Permutations and
MNIST-Rotations benchmarks.
Related papers
- Stitching for Neuroevolution: Recombining Deep Neural Networks without Breaking Them [0.0]
Traditional approaches to neuroevolution often start from scratch.
Recombining trained networks is non-trivial because architectures and feature representations typically differ.
We employ stitching, which merges the networks by introducing new layers at crossover points.
arXiv Detail & Related papers (2024-03-21T08:30:44Z) - Negotiated Representations to Prevent Forgetting in Machine Learning
Applications [0.0]
Catastrophic forgetting is a significant challenge in the field of machine learning.
We propose a novel method for preventing catastrophic forgetting in machine learning applications.
arXiv Detail & Related papers (2023-11-30T22:43:50Z) - Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks [69.38572074372392]
We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks.
Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
arXiv Detail & Related papers (2023-07-13T16:39:08Z) - Modular Approach to Machine Reading Comprehension: Mixture of Task-Aware
Experts [0.5801044612920815]
We present a Mixture of Task-Aware Experts Network for Machine Reading on a relatively small dataset.
We focus on the issue of common-sense learning, enforcing the common ground knowledge.
We take inspi ration on the recent advancements of multitask and transfer learning.
arXiv Detail & Related papers (2022-10-04T17:13:41Z) - Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners [67.5865966762559]
We study whether sparsely activated Mixture-of-Experts (MoE) improve multi-task learning.
We devise task-aware gating functions to route examples from different tasks to specialized experts.
This results in a sparsely activated multi-task model with a large number of parameters, but with the same computational cost as that of a dense model.
arXiv Detail & Related papers (2022-04-16T00:56:12Z) - Thinking Deeply with Recurrence: Generalizing from Easy to Hard
Sequential Reasoning Problems [51.132938969015825]
We observe that recurrent networks have the uncanny ability to closely emulate the behavior of non-recurrent deep models.
We show that recurrent networks that are trained to solve simple mazes with few recurrent steps can indeed solve much more complex problems simply by performing additional recurrences during inference.
arXiv Detail & Related papers (2021-02-22T14:09:20Z) - Beneficial Perturbation Network for designing general adaptive
artificial intelligence systems [14.226973149346886]
We propose a new type of deep neural network with extra, out-of-network, task-dependent biasing units to accommodate dynamic situations.
Our approach is memory-efficient and parameter-efficient, can accommodate many tasks, and achieves state-of-the-art performance across different tasks and domains.
arXiv Detail & Related papers (2020-09-27T01:28:10Z) - Auxiliary Learning by Implicit Differentiation [54.92146615836611]
Training neural networks with auxiliary tasks is a common practice for improving the performance on a main task of interest.
Here, we propose a novel framework, AuxiLearn, that targets both challenges based on implicit differentiation.
First, when useful auxiliaries are known, we propose learning a network that combines all losses into a single coherent objective function.
Second, when no useful auxiliary task is known, we describe how to learn a network that generates a meaningful, novel auxiliary task.
arXiv Detail & Related papers (2020-06-22T19:35:07Z) - Learning to Branch for Multi-Task Learning [12.49373126819798]
We present an automated multi-task learning algorithm that learns where to share or branch within a network.
We propose a novel tree-structured design space that casts a tree branching operation as a gumbel-softmax sampling procedure.
arXiv Detail & Related papers (2020-06-02T19:23:21Z) - Semantic Drift Compensation for Class-Incremental Learning [48.749630494026086]
Class-incremental learning of deep networks sequentially increases the number of classes to be classified.
We propose a new method to estimate the drift, called semantic drift, of features and compensate for it without the need of any exemplars.
arXiv Detail & Related papers (2020-04-01T13:31:19Z) - Side-Tuning: A Baseline for Network Adaptation via Additive Side
Networks [95.51368472949308]
Adaptation can be useful in cases when training data is scarce, or when one wishes to encode priors in the network.
In this paper, we propose a straightforward alternative: side-tuning.
arXiv Detail & Related papers (2019-12-31T18:52:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.