Breaking BERT: Evaluating and Optimizing Sparsified Attention
- URL: http://arxiv.org/abs/2210.03841v1
- Date: Fri, 7 Oct 2022 22:32:27 GMT
- Title: Breaking BERT: Evaluating and Optimizing Sparsified Attention
- Authors: Siddhartha Brahma, Polina Zablotskaia, David Mimno
- Abstract summary: We evaluate the impact of sparsification patterns with a series of ablation experiments.
We find that even using attention that is at least 78% sparse can have little effect on performance if applied at later transformer layers.
- Score: 13.529939025511242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers allow attention between all pairs of tokens, but there is reason
to believe that most of these connections - and their quadratic time and memory
- may not be necessary. But which ones? We evaluate the impact of
sparsification patterns with a series of ablation experiments. First, we
compare masks based on syntax, lexical similarity, and token position to random
connections, and measure which patterns reduce performance the least. We find
that on three common finetuning tasks even using attention that is at least 78%
sparse can have little effect on performance if applied at later transformer
layers, but that applying sparsity throughout the network reduces performance
significantly. Second, we vary the degree of sparsity for three patterns
supported by previous work, and find that connections to neighbouring tokens
are the most significant. Finally, we treat sparsity as an optimizable
parameter, and present an algorithm to learn degrees of neighboring connections
that gives a fine-grained control over the accuracy-sparsity trade-off while
approaching the performance of existing methods.
Related papers
- Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification [6.660834045805309]
Pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism.
We propose integrating two strategies: token pruning and token combining.
Experiments with various datasets demonstrate superior performance compared to baseline models.
arXiv Detail & Related papers (2024-06-03T12:51:52Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Revisiting Token Pruning for Object Detection and Instance Segmentation [25.3324628669201]
We investigate token pruning to accelerate inference for object and instance segmentation.
We show a reduction in performance decline from 1.5 mAP to 0.3 mAP in both boxes and masks, compared to existing token pruning methods.
arXiv Detail & Related papers (2023-06-12T11:55:33Z) - Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient
Vision Transformers [34.19166698049552]
Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs)
We propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module.
We show that our method reduces 48% to 69% FLOPs of MHSA while the accuracy drop is within 0.4%.
arXiv Detail & Related papers (2023-03-24T02:12:28Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully
Exploiting Self-Attention [36.90363317158731]
We propose an adaptive sparse token pruning framework with a minimal cost.
Our method improves the throughput of DeiT-S by 50% and brings only 0.2% drop in top-1 accuracy.
arXiv Detail & Related papers (2022-09-28T03:07:32Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - Unveiling Anomalous Edges and Nominal Connectivity of Attributed
Networks [53.56901624204265]
The present work deals with uncovering anomalous edges in attributed graphs using two distinct formulations with complementary strengths.
The first relies on decomposing the graph data matrix into low rank plus sparse components to improve markedly performance.
The second broadens the scope of the first by performing robust recovery of the unperturbed graph, which enhances the anomaly identification performance.
arXiv Detail & Related papers (2021-04-17T20:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.