Faster Depth-Adaptive Transformers
- URL: http://arxiv.org/abs/2004.13542v4
- Date: Wed, 16 Dec 2020 09:01:38 GMT
- Title: Faster Depth-Adaptive Transformers
- Authors: Yijin Liu, Fandong Meng, Jie Zhou, Yufeng Chen, Jinan Xu
- Abstract summary: Depth-adaptive neural networks can dynamically adjust depths according to the hardness of input words.
Previous works generally build a halting unit to decide whether the computation should continue or stop at each layer.
In this paper, we get rid of the halting unit and estimate the required depths in advance, which yields a faster depth-adaptive model.
- Score: 71.20237659479703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Depth-adaptive neural networks can dynamically adjust depths according to the
hardness of input words, and thus improve efficiency. The main challenge is how
to measure such hardness and decide the required depths (i.e., layers) to
conduct. Previous works generally build a halting unit to decide whether the
computation should continue or stop at each layer. As there is no specific
supervision of depth selection, the halting unit may be under-optimized and
inaccurate, which results in suboptimal and unstable performance when modeling
sentences. In this paper, we get rid of the halting unit and estimate the
required depths in advance, which yields a faster depth-adaptive model.
Specifically, two approaches are proposed to explicitly measure the hardness of
input words and estimate corresponding adaptive depth, namely 1) mutual
information (MI) based estimation and 2) reconstruction loss based estimation.
We conduct experiments on the text classification task with 24 datasets in
various sizes and domains. Results confirm that our approaches can speed up the
vanilla Transformer (up to 7x) while preserving high accuracy. Moreover,
efficiency and robustness are significantly improved when compared with other
depth-adaptive approaches.
Related papers
- Self-supervised Monocular Depth Estimation with Large Kernel Attention [30.44895226042849]
We propose a self-supervised monocular depth estimation network to get finer details.
Specifically, we propose a decoder based on large kernel attention, which can model long-distance dependencies.
Our method achieves competitive results on the KITTI dataset.
arXiv Detail & Related papers (2024-09-26T14:44:41Z) - Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN [9.185929396989083]
We employ a sparse pixel approach to contrastively analyze the distinctions between Transformers and CNNs.
Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity.
We propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and recalibration.
arXiv Detail & Related papers (2023-08-16T12:46:52Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - DDPG-Driven Deep-Unfolding with Adaptive Depth for Channel Estimation
with Sparse Bayesian Learning [23.158142411929322]
We first develop a framework of deep deterministic policy gradient (DDPG)-driven deep-unfolding with adaptive depth for different inputs.
Specifically, the framework is employed to deal with the channel estimation problem in massive multiple-input multiple-output systems.
arXiv Detail & Related papers (2022-01-20T22:35:42Z) - Latency Adjustable Transformer Encoder for Language Understanding [0.8287206589886879]
This paper proposes an efficient Transformer architecture that adjusts the inference computational cost adaptively with a desired inference latency speedup.
The proposed method detects less important hidden sequence elements (word-vectors) and eliminates them in each encoder layer using a proposed Attention Context Contribution (ACC) metric.
The proposed method mathematically and experimentally improves the inference latency of BERT_base and GPT-2 by up to 4.8 and 3.72 times with less than 0.75% accuracy drop and passable perplexity on average.
arXiv Detail & Related papers (2022-01-10T13:04:39Z) - Geometry Uncertainty Projection Network for Monocular 3D Object
Detection [138.24798140338095]
We propose a Geometry Uncertainty Projection Network (GUP Net) to tackle the error amplification problem at both inference and training stages.
Specifically, a GUP module is proposed to obtains the geometry-guided uncertainty of the inferred depth.
At the training stage, we propose a Hierarchical Task Learning strategy to reduce the instability caused by error amplification.
arXiv Detail & Related papers (2021-07-29T06:59:07Z) - An Adaptive Framework for Learning Unsupervised Depth Completion [59.17364202590475]
We present a method to infer a dense depth map from a color image and associated sparse depth measurements.
We show that regularization and co-visibility are related via the fitness of the model to data and can be unified into a single framework.
arXiv Detail & Related papers (2021-06-06T02:27:55Z) - Direct Depth Learning Network for Stereo Matching [79.3665881702387]
A novel Direct Depth Learning Network (DDL-Net) is designed for stereo matching.
DDL-Net consists of two stages: the Coarse Depth Estimation stage and the Adaptive-Grained Depth Refinement stage.
We show that DDL-Net achieves an average improvement of 25% on the SceneFlow dataset and $12%$ on the DrivingStereo dataset.
arXiv Detail & Related papers (2020-12-10T10:33:57Z) - Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime
with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training.
We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget.
We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.