HashFormers: Towards Vocabulary-independent Pre-trained Transformers
- URL: http://arxiv.org/abs/2210.07904v1
- Date: Fri, 14 Oct 2022 15:39:34 GMT
- Title: HashFormers: Towards Vocabulary-independent Pre-trained Transformers
- Authors: Huiyin Xue and Nikolaos Aletras
- Abstract summary: Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding.
We propose HashFormers, a new family of vocabulary-independent pre-trained transformers.
- Score: 30.699644290131044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based pre-trained language models are vocabulary-dependent,
mapping by default each token to its corresponding embedding. This one-to-one
mapping results into embedding matrices that occupy a lot of memory (i.e.
millions of parameters) and grow linearly with the size of the vocabulary.
Previous work on on-device transformers dynamically generate token embeddings
on-the-fly without embedding matrices using locality-sensitive hashing over
morphological information. These embeddings are subsequently fed into
transformer layers for text classification. However, these methods are not
pre-trained. Inspired by this line of work, we propose HashFormers, a new
family of vocabulary-independent pre-trained transformers that support an
unlimited vocabulary (i.e. all possible tokens in a corpus) given a
substantially smaller fixed-sized embedding matrix. We achieve this by first
introducing computationally cheap hashing functions that bucket together
individual tokens to embeddings. We also propose three variants that do not
require an embedding matrix at all, further reducing the memory requirements.
We empirically demonstrate that HashFormers are more memory efficient compared
to standard pre-trained transformers while achieving comparable predictive
performance when fine-tuned on multiple text classification tasks. For example,
our most efficient HashFormer variant has a negligible performance degradation
(0.4\% on GLUE) using only 99.1K parameters for representing the embeddings
compared to 12.3-38M parameters of state-of-the-art models.
Related papers
- Memory-efficient Stochastic methods for Memory-based Transformers [3.360916255196531]
Memory-based transformers can require a large amount of memory and can be quite inefficient.
We propose a novel two-phase training mechanism and a novel regularization technique to improve the training efficiency of memory-based transformers.
arXiv Detail & Related papers (2023-11-14T12:37:25Z) - Frustratingly Simple Memory Efficiency for Pre-trained Language Models
via Dynamic Embedding Pruning [42.652021176354644]
The memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings.
We propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix.
We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks.
arXiv Detail & Related papers (2023-09-15T19:00:00Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Improving language models by retrieving from trillions of tokens [50.42630445476544]
We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus.
With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile.
arXiv Detail & Related papers (2021-12-08T17:32:34Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Hash Layers For Large Sparse Models [48.90784451703753]
We modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence.
We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods.
arXiv Detail & Related papers (2021-06-08T14:54:24Z) - SparseGAN: Sparse Generative Adversarial Network for Text Generation [8.634962333084724]
We propose a SparseGAN that generates semantic-interpretable, but sparse sentence representations as inputs to the discriminator.
With such semantic-rich representations, we not only reduce unnecessary noises for efficient adversarial training, but also make the entire training process fully differentiable.
arXiv Detail & Related papers (2021-03-22T04:44:43Z) - Shortformer: Better Language Modeling using Shorter Inputs [62.51758040848735]
We show that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time.
We then show how to improve the efficiency of recurrence methods in transformers.
arXiv Detail & Related papers (2020-12-31T18:52:59Z) - All Word Embeddings from One Embedding [23.643059189673473]
In neural network-based models for natural language processing, the largest part of the parameters often consists of word embeddings.
In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding.
The proposed method, ALONE, constructs the embedding of a word by modifying the shared embedding with a filter vector, which is word-specific but non-trainable.
arXiv Detail & Related papers (2020-04-25T07:38:08Z) - Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix.
On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.