ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through
Regularized Self-Attention
- URL: http://arxiv.org/abs/2203.12276v1
- Date: Wed, 23 Mar 2022 08:47:01 GMT
- Title: ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through
Regularized Self-Attention
- Authors: Yang Liu, Jiaxiang Liu, Li Chen, Yuxiang Lu, Shikun Feng, Zhida Feng,
Yu Sun, Hao Tian, Hua Wu, Haifeng Wang
- Abstract summary: Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer.
This paper proposes a well-designed model named ERNIE-Sparse.
It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
- Score: 48.697458429460184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse Transformer has recently attracted a lot of attention since the
ability for reducing the quadratic dependency on the sequence length. We argue
that two factors, information bottleneck sensitivity and inconsistency between
different attention topologies, could affect the performance of the Sparse
Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It
consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to
sequentially unify local and global information. (ii) Self-Attention
Regularization (SAR) method, a novel regularization designed to minimize the
distance for transformers with different attention topologies. To evaluate the
effectiveness of ERNIE-Sparse, we perform extensive evaluations. Firstly, we
perform experiments on a multi-modal long sequence modeling task benchmark,
Long Range Arena (LRA). Experimental results demonstrate that ERNIE-Sparse
significantly outperforms a variety of strong baseline methods including the
dense attention and other efficient sparse attention methods and achieves
improvements by 2.77% (57.78% vs. 55.01%). Secondly, to further show the
effectiveness of our method, we pretrain ERNIE-Sparse and verified it on 3 text
classification and 2 QA downstream tasks, achieve improvements on
classification benchmark by 0.83% (92.46% vs. 91.63%), on QA benchmark by 3.24%
(74.67% vs. 71.43%). Experimental results continue to demonstrate its superior
performance.
Related papers
- Co-Fix3D: Enhancing 3D Object Detection with Collaborative Refinement [33.773644087620745]
Co-Fix3D employs a collaborative hybrid multi-stage parallel query generation mechanism for BEV representations.
Our method incorporates the Local-Global Feature Enhancement (LGE) module, which refines BEV features to more effectively highlight weak positive samples.
Co-Fix3D achieves superior results on the stringent nuScenes benchmark.
arXiv Detail & Related papers (2024-08-15T07:56:02Z) - KAN-RCBEVDepth: A multi-modal fusion algorithm in object detection for autonomous driving [2.382388777981433]
This paper introduces the KAN-RCBEVDepth method to enhance 3D object detection in autonomous driving.
Our unique Bird's Eye View-based approach significantly improves detection accuracy and efficiency.
The code will be released in urlhttps://www.laitiamo.com/laitiamo/RCBEVDepth-KAN.
arXiv Detail & Related papers (2024-08-04T16:54:49Z) - SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation [15.811141677039224]
State-of-the-art methods, particularly those utilizing transformers, have been prominently adopted in 3D semantic segmentation.
However, plain vision transformers encounter challenges due to their neglect of local features and their high computational complexity.
We propose SegStitch, an innovative architecture that integrates transformers with denoising ODE blocks.
arXiv Detail & Related papers (2024-08-01T12:05:02Z) - MLAE: Masked LoRA Experts for Visual Parameter-Efficient Fine-Tuning [45.93128932828256]
Masked LoRA Experts (MLAE) is an innovative approach that applies the concept of masking to visual PEFT.
Our method incorporates a cellular decomposition strategy that transforms a low-rank matrix into independent rank-1 submatrices.
We show that MLAE achieves new state-of-the-art (SOTA) performance with an average accuracy score of 78.8% on the VTAB-1k benchmark and 90.9% on the FGVC benchmark.
arXiv Detail & Related papers (2024-05-29T08:57:23Z) - Fine-tuning Strategies for Faster Inference using Speech Self-Supervised
Models: A Comparative Study [25.58608455210458]
Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings.
This article explores different approaches that may be deployed during the fine-tuning to reduce the computations needed in the SSL encoder.
arXiv Detail & Related papers (2023-03-12T19:52:34Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples.
We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment.
We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z) - ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning [91.13797346047984]
We introduce ADAHESSIAN, a second order optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates.
We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods.
arXiv Detail & Related papers (2020-06-01T05:00:51Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.