Related papers: Long-Range Transformer Architectures for Document Understanding

Long-Range Transformer Architectures for Document Understanding

URL: http://arxiv.org/abs/2309.05503v1
Date: Mon, 11 Sep 2023 14:45:24 GMT
Title: Long-Range Transformer Architectures for Document Understanding
Authors: Thibault Douzon, Stefan Duffner, Christophe Garcia and J\'er\'emy Espinas
Abstract summary: Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019. We introduce 2 new multi-modal (text + layout) long-range models for DU based on efficient implementations of Transformers for long sequences. Relative 2D attention revealed to be effective on dense text for both normal and long-range models.
Score: 1.9331361036118608
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since their release, Transformers have revolutionized many fields from Natural Language Understanding to Computer Vision. Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019. However, the computational complexity of the self-attention operation limits their capabilities to small sequences. In this paper we explore multiple strategies to apply Transformer based models to long multi-page documents. We introduce 2 new multi-modal (text + layout) long-range models for DU. They are based on efficient implementations of Transformers for long sequences. Long-range models can process whole documents at once effectively and are less impaired by the document's length. We compare them to LayoutLM, a classical Transformer adapted for DU and pre-trained on millions of documents. We further propose 2D relative attention bias to guide self-attention towards relevant tokens without harming model efficiency. We observe improvements on multi-page business documents on Information Retrieval for a small performance cost on smaller sequences. Relative 2D attention revealed to be effective on dense text for both normal and long-range models.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z)
Transformer-based Models for Long-Form Document Matching: Challenges and Empirical Analysis [12.269318291685753]
We show that simple neural models outperform the more complex BERT-based models. Simple models are also more robust to variations in document length and text perturbations.
arXiv Detail & Related papers (2023-02-07T21:51:05Z)
An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification [37.069127262896764]
Non-hierarchical sparse attention Transformer-based models, such as Longformer and Big Bird, are popular approaches to working with long documents. We develop and release fully pre-trained HAT models that use segment-wise followed by cross-segment encoders. Our best HAT model outperforms equally-sized Longformer models while using 10-20% less GPU memory and processing documents 40-45% faster.
arXiv Detail & Related papers (2022-10-11T15:17:56Z)
HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization. Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z)
Long-Short Transformer: Efficient Transformers for Language and Vision [97.2850205384295]
Long-Short Transformer (Transformer-LS) is an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification.
arXiv Detail & Related papers (2021-07-05T18:00:14Z)
Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling [51.79399904527525]
We propose a hierarchical interactive Transformer (Hi-Transformer) for efficient and effective long document modeling. Hi-Transformer models documents in a hierarchical way, first learns sentence representations and then learns document representations. Experiments on three benchmark datasets validate the efficiency and effectiveness of Hi-Transformer in long document modeling.
arXiv Detail & Related papers (2021-06-02T09:30:29Z)
Long-Span Dependencies in Transformer-based Summarization Systems [38.672160430296536]
Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization.
arXiv Detail & Related papers (2021-05-08T23:53:03Z)
Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens. We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.