Hash Layers For Large Sparse Models
- URL: http://arxiv.org/abs/2106.04426v1
- Date: Tue, 8 Jun 2021 14:54:24 GMT
- Title: Hash Layers For Large Sparse Models
- Authors: Stephen Roller, Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston
- Abstract summary: We modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence.
We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods.
- Score: 48.90784451703753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the training of sparse layers that use different parameters
for different inputs based on hashing in large Transformer models.
Specifically, we modify the feedforward layer to hash to different sets of
weights depending on the current token, over all tokens in the sequence. We
show that this procedure either outperforms or is competitive with
learning-to-route mixture-of-expert methods such as Switch Transformers and
BASE Layers, while requiring no routing parameters or extra terms in the
objective function such as a load balancing loss, and no sophisticated
assignment algorithm. We study the performance of different hashing techniques,
hash sizes and input features, and show that balanced and random hashes focused
on the most local features work best, compared to either learning clusters or
using longer-range context. We show our approach works well both on large
language modeling and dialogue tasks, and on downstream fine-tuning tasks.
Related papers
- Merging Multi-Task Models via Weight-Ensembling Mixture of Experts [64.94129594112557]
Merging Transformer-based models trained on different tasks into a single unified model can execute all the tasks concurrently.
Previous methods, exemplified by task arithmetic, have been proven to be both effective and scalable.
We propose to merge most of the parameters while upscaling the Transformer layers to a weight-ensembling mixture of experts (MoE) module.
arXiv Detail & Related papers (2024-02-01T08:58:57Z) - Scalarization for Multi-Task and Multi-Domain Learning at Scale [15.545810422759295]
Training a single model on multiple input domains and/or output tasks allows for compressing information from multiple sources into a unified backbone.
However, optimizing such networks is a challenge due to discrepancies between the different tasks or domains.
arXiv Detail & Related papers (2023-10-13T07:31:04Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Binary Representation via Jointly Personalized Sparse Hashing [22.296464665032588]
We propose an effective unsupervised method, namely Jointly Personalized Sparse Hashing (JPSH) for binary representation learning.
Different personalized subspaces are constructed to reflect category-specific attributes for different clusters.
To simultaneously preserve semantic and pairwise similarities in our JPSH, we incorporate the PSH and manifold-based hash learning into the seamless formulation.
arXiv Detail & Related papers (2022-08-31T14:18:37Z) - DiSparse: Disentangled Sparsification for Multitask Model Compression [92.84435347164435]
DiSparse is a simple, effective, and first-of-its-kind multitask pruning and sparse training scheme.
Our experimental results demonstrate superior performance on various configurations and settings.
arXiv Detail & Related papers (2022-06-09T17:57:46Z) - Training Neural Networks with Fixed Sparse Masks [19.58969772430058]
Recent work has shown that it is possible to update only a small subset of the model's parameters during training.
We show that it is possible to induce a fixed sparse mask on the model's parameters that selects a subset to update over many iterations.
arXiv Detail & Related papers (2021-11-18T18:06:01Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - Multi-scale Interactive Network for Salient Object Detection [91.43066633305662]
We propose the aggregate interaction modules to integrate the features from adjacent levels.
To obtain more efficient multi-scale features, the self-interaction modules are embedded in each decoder unit.
Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches.
arXiv Detail & Related papers (2020-07-17T15:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.