Long-Range Transformer Architectures for Document Understanding
- URL: http://arxiv.org/abs/2309.05503v1
- Date: Mon, 11 Sep 2023 14:45:24 GMT
- Title: Long-Range Transformer Architectures for Document Understanding
- Authors: Thibault Douzon, Stefan Duffner, Christophe Garcia and J\'er\'emy
Espinas
- Abstract summary: Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.
We introduce 2 new multi-modal (text + layout) long-range models for DU based on efficient implementations of Transformers for long sequences.
Relative 2D attention revealed to be effective on dense text for both normal and long-range models.
- Score: 1.9331361036118608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since their release, Transformers have revolutionized many fields from
Natural Language Understanding to Computer Vision. Document Understanding (DU)
was not left behind with first Transformer based models for DU dating from late
2019. However, the computational complexity of the self-attention operation
limits their capabilities to small sequences. In this paper we explore multiple
strategies to apply Transformer based models to long multi-page documents. We
introduce 2 new multi-modal (text + layout) long-range models for DU. They are
based on efficient implementations of Transformers for long sequences.
Long-range models can process whole documents at once effectively and are less
impaired by the document's length. We compare them to LayoutLM, a classical
Transformer adapted for DU and pre-trained on millions of documents. We further
propose 2D relative attention bias to guide self-attention towards relevant
tokens without harming model efficiency. We observe improvements on multi-page
business documents on Information Retrieval for a small performance cost on
smaller sequences. Relative 2D attention revealed to be effective on dense text
for both normal and long-range models.
Related papers
- Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - Transformer-based Models for Long-Form Document Matching: Challenges and
Empirical Analysis [12.269318291685753]
We show that simple neural models outperform the more complex BERT-based models.
Simple models are also more robust to variations in document length and text perturbations.
arXiv Detail & Related papers (2023-02-07T21:51:05Z) - An Exploration of Hierarchical Attention Transformers for Efficient Long
Document Classification [37.069127262896764]
Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents.
We develop and release fully pre-trained HAT models that use segment-wise followed by cross-segment encoders.
Our best HAT model outperforms equally-sized Longformer models while using 10-20% less GPU memory and processing documents 40-45% faster.
arXiv Detail & Related papers (2022-10-11T15:17:56Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks.
It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations.
Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z) - Hi-Transformer: Hierarchical Interactive Transformer for Efficient and
Effective Long Document Modeling [51.79399904527525]
We propose a hierarchical interactive Transformer (Hi-Transformer) for efficient and effective long document modeling.
Hi-Transformer models documents in a hierarchical way, first learns sentence representations and then learns document representations.
Experiments on three benchmark datasets validate the efficiency and effectiveness of Hi-Transformer in long document modeling.
arXiv Detail & Related papers (2021-06-02T09:30:29Z) - Long-Span Dependencies in Transformer-based Summarization Systems [38.672160430296536]
Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization.
One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows.
In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization.
arXiv Detail & Related papers (2021-05-08T23:53:03Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.