Sparsifying Transformer Models with Trainable Representation Pooling
- URL: http://arxiv.org/abs/2009.05169v4
- Date: Mon, 7 Mar 2022 12:49:08 GMT
- Title: Sparsifying Transformer Models with Trainable Representation Pooling
- Authors: Micha{\l} Pietruszka, {\L}ukasz Borchmann, {\L}ukasz Garncarek
- Abstract summary: We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process.
A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top-$k$ operator.
- Score: 5.575448433529451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel method to sparsify attention in the Transformer model by
learning to select the most-informative token representations during the
training process, thus focusing on the task-specific parts of an input. A
reduction of quadratic time and memory complexity to sublinear was achieved due
to a robust trainable top-$k$ operator. Our experiments on a challenging long
document summarization task show that even our simple baseline performs
comparably to the current SOTA, and with trainable pooling, we can retain its
top quality, while being $1.8\times$ faster during training, $4.5\times$ faster
during inference, and up to $13\times$ more computationally efficient in the
decoder.
Related papers
- FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models [35.40065954148091]
FINE is a method based on the Learngene framework to initializing downstream networks leveraging pre-trained models.
It decomposes pre-trained knowledge into the product of matrices (i.e., $U$, $Sigma$, and $V$), where $U$ and $V$ are shared across network blocks as learngenes''
It consistently outperforms direct pre-training, particularly for smaller models, achieving state-of-the-art results across variable model sizes.
arXiv Detail & Related papers (2024-09-28T08:57:17Z) - Efficient Verification-Based Face Identification [50.616875565173274]
We study the problem of performing face verification with an efficient neural model $f$.
Our model leads to a substantially small $f$ requiring only 23k parameters and 5M floating point operations (FLOPS)
We use six face verification datasets to demonstrate that our method is on par or better than state-of-the-art models.
arXiv Detail & Related papers (2023-12-20T18:08:02Z) - On the Effectiveness of LayerNorm Tuning for Continual Learning in
Vision Transformers [47.77328392236625]
State-of-the-art rehearsal-free continual learning methods exploit the peculiarities of Vision Transformers to learn task-specific prompts.
We introduce a two-stage training procedure, where we first optimize the task-specific parameters and then train the classifier with the same selection procedure of the inference time.
Our method achieves results that are either superior or on par with the state of the art while being computationally cheaper.
arXiv Detail & Related papers (2023-08-18T15:11:16Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Leveraging universality of jet taggers through transfer learning [0.0]
In this article, we explore the use of transfer learning techniques to develop fast and data-efficient jet taggers.
We find that one can obtain reliable taggers using an order of magnitude less data with a corresponding speed-up of the training process.
This offers a promising avenue to facilitate the use of such tools in collider physics experiments.
arXiv Detail & Related papers (2022-03-11T19:05:26Z) - DoT: An efficient Double Transformer for NLP tasks with tables [3.0079490585515343]
DoT is a double transformer model that decomposes the problem into two sub-tasks.
We show that for a small drop of accuracy, DoT improves training and inference time by at least 50%.
arXiv Detail & Related papers (2021-06-01T13:33:53Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.