AdapLeR: Speeding up Inference by Adaptive Length Reduction
- URL: http://arxiv.org/abs/2203.08991v1
- Date: Wed, 16 Mar 2022 23:41:38 GMT
- Title: AdapLeR: Speeding up Inference by Adaptive Length Reduction
- Authors: Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar
- Abstract summary: We propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance.
Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost.
Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance.
- Score: 15.57872065467772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models have shown stellar performance in various
downstream tasks. But, this usually comes at the cost of high latency and
computation, hindering their usage in resource-limited settings. In this work,
we propose a novel approach for reducing the computational cost of BERT with
minimal loss in downstream performance. Our method dynamically eliminates less
contributing tokens through layers, resulting in shorter lengths and
consequently lower computational cost. To determine the importance of each
token representation, we train a Contribution Predictor for each layer using a
gradient-based saliency method. Our experiments on several diverse
classification tasks show speedups up to 22x during inference time without much
sacrifice in performance. We also validate the quality of the selected tokens
in our method using human annotations in the ERASER benchmark. In comparison to
other widely used strategies for selecting important tokens, such as saliency
and attention, our proposed method has a significantly lower false positive
rate in generating rationales. Our code is freely available at
https://github.com/amodaresi/AdapLeR .
Related papers
- Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning [18.776903525210933]
We introduce an efficient fine-tuning method for ViTs called $textbfALaST$ ($textitAdaptive Layer Selection Fine-Tuning for Vision Transformers$)
Our approach is based on the observation that not all layers are equally critical during fine-tuning, and their importance varies depending on the current mini-batch.
We show that this adaptive compute allocation enables a nearly-optimal schedule for distributing computational resources.
arXiv Detail & Related papers (2024-08-16T11:27:52Z) - VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections [35.133698935322634]
Large language models (LLMs) have recently emerged as powerful tools for tackling many language-processing tasks.
We identify and characterise the important components needed for effective model convergence using gradient descent.
This result leads us to a cheap and memory-efficient algorithm for both fine-tuning and pre-training LLMs.
arXiv Detail & Related papers (2024-05-28T09:23:14Z) - EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models [40.651650382105636]
Vanilla method adds padding tokens in order to ensure that the number of new tokens remains consistent across samples.
We propose a novel method that can resolve the issue of inconsistent tokens accepted by different samples without necessitating an increase in memory or computing overhead.
Our proposed method can handle the situation where the prediction tokens of different samples are inconsistent without the need to add padding tokens.
arXiv Detail & Related papers (2024-05-13T08:24:21Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [29.319666323947708]
We present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness.
Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context.
Our reference implementation achieves up to $2times$ increase in inference throughput and even greater memory savings.
arXiv Detail & Related papers (2023-05-25T07:39:41Z) - Revisiting Token Dropping Strategy in Efficient BERT Pretraining [102.24112230802011]
Token dropping is a strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers.
However, we empirically find that token dropping is prone to a semantic loss problem and falls short in handling semantic-intense tasks.
Motivated by this, we propose a simple yet effective semantic-consistent learning method (ScTD) to improve the token dropping.
arXiv Detail & Related papers (2023-05-24T15:59:44Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - SparseDet: Improving Sparsely Annotated Object Detection with
Pseudo-positive Mining [76.95808270536318]
We propose an end-to-end system that learns to separate proposals into labeled and unlabeled regions using Pseudo-positive mining.
While the labeled regions are processed as usual, self-supervised learning is used to process the unlabeled regions.
We conduct exhaustive experiments on five splits on the PASCAL-VOC and COCO datasets achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-01-12T18:57:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.