Pyramid-BERT: Reducing Complexity via Successive Core-set based Token
Selection
- URL: http://arxiv.org/abs/2203.14380v1
- Date: Sun, 27 Mar 2022 19:52:01 GMT
- Title: Pyramid-BERT: Reducing Complexity via Successive Core-set based Token
Selection
- Authors: Xin Huang, Ashish Khetan, Rene Bidart, Zohar Karnin
- Abstract summary: Transformer-based language models such as BERT have achieved the state-of-the-art on various NLP tasks, but are computationally prohibitive.
We present Pyramid-BERT where we replace previously useds with a em core-set based token selection method justified by theoretical results.
The core-set based token selection technique allows us to avoid expensive pre-training, gives a space-efficient fine tuning, and thus makes it suitable to handle longer sequence lengths.
- Score: 23.39962989492527
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer-based language models such as BERT have achieved the
state-of-the-art performance on various NLP tasks, but are computationally
prohibitive. A recent line of works use various heuristics to successively
shorten sequence length while transforming tokens through encoders, in tasks
such as classification and ranking that require a single token embedding for
prediction. We present a novel solution to this problem, called Pyramid-BERT
where we replace previously used heuristics with a {\em core-set} based token
selection method justified by theoretical results. The core-set based token
selection technique allows us to avoid expensive pre-training, gives a
space-efficient fine tuning, and thus makes it suitable to handle longer
sequence lengths. We provide extensive experiments establishing advantages of
pyramid BERT over several baselines and existing works on the GLUE benchmarks
and Long Range Arena datasets.
Related papers
- Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Breaking the Token Barrier: Chunking and Convolution for Efficient Long
Text Classification with BERT [0.0]
Transformer-based models, specifically BERT, have propelled research in various NLP tasks.
BERT models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting with long input.
We propose a relatively simple extension to vanilla BERT architecture called ChunkBERT that allows finetuning of any pretrained models to perform inference on arbitrarily long text.
arXiv Detail & Related papers (2023-10-31T15:41:08Z) - Efficient Long Sequence Encoding via Synchronization [29.075962393432857]
We propose a synchronization mechanism for hierarchical encoding.
Our approach first identifies anchor tokens across segments and groups them by their roles in the original input sequence.
Our approach is able to improve the global information exchange among segments while maintaining efficiency.
arXiv Detail & Related papers (2022-03-15T04:37:02Z) - Hierarchical Neural Network Approaches for Long Document Classification [3.6700088931938835]
We employ pre-trained Universal Sentence (USE) and Bidirectional Representations from Transformers (BERT) in a hierarchical setup to capture better representations efficiently.
Our proposed models are conceptually simple where we divide the input data into chunks and then pass this through base models of BERT and USE.
We show that USE + CNN/LSTM performs better than its stand-alone baseline. Whereas the BERT + CNN/LSTM performs on par with its stand-alone counterpart.
arXiv Detail & Related papers (2022-01-18T07:17:40Z) - Accelerating BERT Inference for Sequence Labeling via Early-Exit [65.7292767360083]
We extend the recent successful early-exit mechanism to accelerate the inference of PTMs for sequence labeling tasks.
We also propose a token-level early-exit mechanism that allows partial tokens to exit early at different layers.
Our approach can save up to 66%-75% inference cost with minimal performance degradation.
arXiv Detail & Related papers (2021-05-28T14:39:26Z) - TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference [54.791572981834435]
Existing pre-trained language models (PLMs) are often computationally expensive in inference.
We propose a dynamic token reduction approach to accelerate PLMs' inference, named TR-BERT.
TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning.
arXiv Detail & Related papers (2021-05-25T02:28:51Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z) - Conformer-Kernel with Query Term Independence for Document Retrieval [32.36908635150144]
The Transformer- Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark.
We extend the TK architecture to the full retrieval setting by incorporating the query term independence assumption.
We show that the Conformer's GPU memory requirement scales linearly with input sequence length, making it a more viable option when ranking long documents.
arXiv Detail & Related papers (2020-07-20T19:47:28Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.