An Exploration of Hierarchical Attention Transformers for Efficient Long
Document Classification
- URL: http://arxiv.org/abs/2210.05529v1
- Date: Tue, 11 Oct 2022 15:17:56 GMT
- Title: An Exploration of Hierarchical Attention Transformers for Efficient Long
Document Classification
- Authors: Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis,
Desmond Elliott
- Abstract summary: Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents.
We develop and release fully pre-trained HAT models that use segment-wise followed by cross-segment encoders.
Our best HAT model outperforms equally-sized Longformer models while using 10-20% less GPU memory and processing documents 40-45% faster.
- Score: 37.069127262896764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Non-hierarchical sparse attention Transformer-based models, such as
Longformer and Big Bird, are popular approaches to working with long documents.
There are clear benefits to these approaches compared to the original
Transformer in terms of efficiency, but Hierarchical Attention Transformer
(HAT) models are a vastly understudied alternative. We develop and release
fully pre-trained HAT models that use segment-wise followed by cross-segment
encoders and compare them with Longformer models and partially pre-trained
HATs. In several long document downstream classification tasks, our best HAT
model outperforms equally-sized Longformer models while using 10-20% less GPU
memory and processing documents 40-45% faster. In a series of ablation studies,
we find that HATs perform best with cross-segment contextualization throughout
the model than alternative configurations that implement either early or late
cross-segment contextualization. Our code is on GitHub:
https://github.com/coastalcph/hierarchical-transformers.
Related papers
- Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - Long-Range Transformer Architectures for Document Understanding [1.9331361036118608]
Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.
We introduce 2 new multi-modal (text + layout) long-range models for DU based on efficient implementations of Transformers for long sequences.
Relative 2D attention revealed to be effective on dense text for both normal and long-range models.
arXiv Detail & Related papers (2023-09-11T14:45:24Z) - Fourier Transformer: Fast Long Range Modeling by Removing Sequence
Redundancy with FFT Operator [24.690247474891958]
Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models.
Our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA.
For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART.
arXiv Detail & Related papers (2023-05-24T12:33:06Z) - Transformer-based Models for Long-Form Document Matching: Challenges and
Empirical Analysis [12.269318291685753]
We show that simple neural models outperform the more complex BERT-based models.
Simple models are also more robust to variations in document length and text perturbations.
arXiv Detail & Related papers (2023-02-07T21:51:05Z) - Revisiting Transformer-based Models for Long Document Classification [31.60414185940218]
In real-world applications, multi-page multi-paragraph documents are common and cannot be efficiently encoded by vanilla Transformer-based models.
We compare different Transformer-based Long Document Classification (TrLDC) approaches that aim to mitigate the computational overhead of vanilla transformers.
We observe a clear benefit from being able to process longer text, and, based on our results, we derive practical advice of applying Transformer-based models on long document classification tasks.
arXiv Detail & Related papers (2022-04-14T00:44:36Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.