More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing
- URL: http://arxiv.org/abs/2410.08003v2
- Date: Fri, 18 Oct 2024 18:26:38 GMT
- Title: More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing
- Authors: Sagi Shaier, Francisco Pereira, Katharina von der Wense, Lawrence E Hunter, Matt Jones,
- Abstract summary: Conditionally Overlapping Mixture of ExperTs (COMET) is a general deep learning method that inducing a modular, sparse architecture with an exponential number of overlapping experts.
We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression.
- Score: 5.846028298833611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The evolution of biological neural systems has led to both modularity and sparse coding, which enables efficiency in energy usage, and robustness across the diversity of tasks in the lifespan. In contrast, standard neural networks rely on dense, non-specialized architectures, where all model parameters are simultaneously updated to learn multiple tasks, leading to representation interference. Current sparse neural network approaches aim to alleviate this issue, but are often hindered by limitations such as 1) trainable gating functions that cause representation collapse; 2) non-overlapping experts that result in redundant computation and slow learning; and 3) reliance on explicit input or task IDs that impose significant constraints on flexibility and scalability. In this paper we propose Conditionally Overlapping Mixture of ExperTs (COMET), a general deep learning method that addresses these challenges by inducing a modular, sparse architecture with an exponential number of overlapping experts. COMET replaces the trainable gating function used in Sparse Mixture of Experts with a fixed, biologically inspired random projection applied to individual input representations. This design causes the degree of expert overlap to depend on input similarity, so that similar inputs tend to share more parameters. This facilitates positive knowledge transfer, resulting in faster learning and improved generalization. We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression, using several popular deep learning architectures.
Related papers
- Flexible task abstractions emerge in linear networks with fast and bounded units [47.11054206483159]
We analyze a linear gated network where the weights and gates are jointly optimized via gradient descent.
We show that the discovered task abstractions support generalization through both task and subtask composition.
Our work offers a theory of cognitive flexibility in animals as arising from joint gradient descent on synaptic and neural gating.
arXiv Detail & Related papers (2024-11-06T11:24:02Z) - Equivariance with Learned Canonicalization Functions [77.32483958400282]
We show that learning a small neural network to perform canonicalization is better than using predefineds.
Our experiments show that learning the canonicalization function is competitive with existing techniques for learning equivariant functions across many tasks.
arXiv Detail & Related papers (2022-11-11T21:58:15Z) - Improving the Robustness of Neural Multiplication Units with Reversible
Stochasticity [2.4278445972594525]
Multilayer Perceptrons struggle to learn certain simple arithmetic tasks.
Specialist neural NMU (sNMU) is proposed to apply reversibleity, encouraging avoidance of such optima.
arXiv Detail & Related papers (2022-11-10T14:56:37Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - Spatio-Temporal Representation Factorization for Video-based Person
Re-Identification [55.01276167336187]
We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID.
STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID.
We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
arXiv Detail & Related papers (2021-07-25T19:29:37Z) - Recognizing and Verifying Mathematical Equations using Multiplicative
Differential Neural Units [86.9207811656179]
We show that memory-augmented neural networks (NNs) can achieve higher-order, memory-augmented extrapolation, stable performance, and faster convergence.
Our models achieve a 1.53% average improvement over current state-of-the-art methods in equation verification and achieve a 2.22% Top-1 average accuracy and 2.96% Top-5 average accuracy for equation completion.
arXiv Detail & Related papers (2021-04-07T03:50:11Z) - Multi-task Supervised Learning via Cross-learning [102.64082402388192]
We consider a problem known as multi-task learning, consisting of fitting a set of regression functions intended for solving different tasks.
In our novel formulation, we couple the parameters of these functions, so that they learn in their task specific domains while staying close to each other.
This facilitates cross-fertilization in which data collected across different domains help improving the learning performance at each other task.
arXiv Detail & Related papers (2020-10-24T21:35:57Z) - Beneficial Perturbation Network for designing general adaptive
artificial intelligence systems [14.226973149346886]
We propose a new type of deep neural network with extra, out-of-network, task-dependent biasing units to accommodate dynamic situations.
Our approach is memory-efficient and parameter-efficient, can accommodate many tasks, and achieves state-of-the-art performance across different tasks and domains.
arXiv Detail & Related papers (2020-09-27T01:28:10Z) - Reparameterizing Convolutions for Incremental Multi-Task Learning
without Task Interference [75.95287293847697]
Two common challenges in developing multi-task models are often overlooked in literature.
First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning)
Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference)
arXiv Detail & Related papers (2020-07-24T14:44:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.