Accelerating Pre-trained Language Models via Calibrated Cascade
- URL: http://arxiv.org/abs/2012.14682v1
- Date: Tue, 29 Dec 2020 09:43:50 GMT
- Title: Accelerating Pre-trained Language Models via Calibrated Cascade
- Authors: Lei Li, Yankai Lin, Shuhuai Ren, Deli Chen, Xuancheng Ren, Peng Li,
Jie Zhou, Xu Sun
- Abstract summary: We analyze the working mechanism of dynamic early exiting and find it cannot achieve a satisfying trade-off between inference speed and performance.
We propose CascadeBERT, which dynamically selects a proper-sized, complete model in a cascading manner.
- Score: 37.00619245086208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dynamic early exiting aims to accelerate pre-trained language models' (PLMs)
inference by exiting in shallow layer without passing through the entire model.
In this paper, we analyze the working mechanism of dynamic early exiting and
find it cannot achieve a satisfying trade-off between inference speed and
performance. On one hand, the PLMs' representations in shallow layers are not
sufficient for accurate prediction. One the other hand, the internal off-ramps
cannot provide reliable exiting decisions. To remedy this, we instead propose
CascadeBERT, which dynamically selects a proper-sized, complete model in a
cascading manner. To obtain more reliable model selection, we further devise a
difficulty-aware objective, encouraging the model output class probability to
reflect the real difficulty of each instance. Extensive experimental results
demonstrate the superiority of our proposal over strong baseline models of
PLMs' acceleration including both dynamic early exiting and knowledge
distillation methods.
Related papers
- Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - Plan To Predict: Learning an Uncertainty-Foreseeing Model for
Model-Based Reinforcement Learning [32.24146877835396]
We propose emphPlan To Predict (P2P), a framework that treats the model rollout process as a sequential decision making problem.
We show that P2P achieves state-of-the-art performance on several challenging benchmark tasks.
arXiv Detail & Related papers (2023-01-20T10:17:22Z) - COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency
with Slenderized Multi-exit Language Models [16.586312156966635]
Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity.
Existing statically compressed models are unaware of the diverse complexities between input instances.
We propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration.
arXiv Detail & Related papers (2022-10-27T15:06:40Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Sample Efficient Reinforcement Learning via Model-Ensemble Exploration
and Exploitation [3.728946517493471]
MEEE is a model-ensemble method that consists of optimistic exploration and weighted exploitation.
Our approach outperforms other model-free and model-based state-of-the-art methods, especially in sample complexity.
arXiv Detail & Related papers (2021-07-05T07:18:20Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model.
Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.