Attention over pre-trained Sentence Embeddings for Long Document
Classification
- URL: http://arxiv.org/abs/2307.09084v1
- Date: Tue, 18 Jul 2023 09:06:35 GMT
- Title: Attention over pre-trained Sentence Embeddings for Long Document
Classification
- Authors: Amine Abdaoui and Sourav Dutta
- Abstract summary: transformers are often limited to short sequences due to their quadratic attention complexity on the number of tokens.
We suggest to take advantage of pre-trained sentence transformers to start from semantically meaningful embeddings of the individual sentences.
We report the results obtained by this simple architecture on three standard document classification datasets.
- Score: 4.38566347001872
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite being the current de-facto models in most NLP tasks, transformers are
often limited to short sequences due to their quadratic attention complexity on
the number of tokens. Several attempts to address this issue were studied,
either by reducing the cost of the self-attention computation or by modeling
smaller sequences and combining them through a recurrence mechanism or using a
new transformer model. In this paper, we suggest to take advantage of
pre-trained sentence transformers to start from semantically meaningful
embeddings of the individual sentences, and then combine them through a small
attention layer that scales linearly with the document length. We report the
results obtained by this simple architecture on three standard document
classification datasets. When compared with the current state-of-the-art models
using standard fine-tuning, the studied method obtains competitive results
(even if there is no clear best model in this configuration). We also showcase
that the studied architecture obtains better results when freezing the
underlying transformers. A configuration that is useful when we need to avoid
complete fine-tuning (e.g. when the same frozen transformer is shared by
different applications). Finally, two additional experiments are provided to
further evaluate the relevancy of the studied architecture over simpler
baselines.
Related papers
- Transformer-based Models for Long-Form Document Matching: Challenges and
Empirical Analysis [12.269318291685753]
We show that simple neural models outperform the more complex BERT-based models.
Simple models are also more robust to variations in document length and text perturbations.
arXiv Detail & Related papers (2023-02-07T21:51:05Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Mixed-effects transformers for hierarchical adaptation [1.9105318290910576]
We introduce the mixed-effects transformer (MET), a novel approach for learning hierarchically-structured prefixes.
We show how the popular class of mixed-effects models may be extended to transformer-based architectures.
arXiv Detail & Related papers (2022-05-03T19:34:15Z) - Paragraph-based Transformer Pre-training for Multi-Sentence Inference [99.59693674455582]
We show that popular pre-trained transformers perform poorly when used for fine-tuning on multi-candidate inference tasks.
We then propose a new pre-training objective that models the paragraph-level semantics across multiple input sentences.
arXiv Detail & Related papers (2022-05-02T21:41:14Z) - Causal Transformer for Estimating Counterfactual Outcomes [18.640006398066188]
Estimating counterfactual outcomes over time from observational data is relevant for many applications.
We develop a novel Causal Transformer for estimating counterfactual outcomes over time.
Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders.
arXiv Detail & Related papers (2022-04-14T22:40:09Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens.
We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z) - Long-Short Term Masking Transformer: A Simple but Effective Baseline for
Document-level Neural Machine Translation [28.94748226472447]
We study the pros and cons of the standard transformer in document-level translation.
We propose a surprisingly simple long-short term masking self-attention on top of the standard transformer.
We can achieve a strong result in BLEU and capture discourse phenomena.
arXiv Detail & Related papers (2020-09-19T00:29:51Z) - The Cascade Transformer: an Application for Efficient Answer Sentence
Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers.
When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.