Related papers: Attention over pre-trained Sentence Embeddings for Long Document Classification

Attention over pre-trained Sentence Embeddings for Long Document Classification

URL: http://arxiv.org/abs/2307.09084v1
Date: Tue, 18 Jul 2023 09:06:35 GMT
Title: Attention over pre-trained Sentence Embeddings for Long Document Classification
Authors: Amine Abdaoui and Sourav Dutta
Abstract summary: transformers are often limited to short sequences due to their quadratic attention complexity on the number of tokens. We suggest to take advantage of pre-trained sentence transformers to start from semantically meaningful embeddings of the individual sentences. We report the results obtained by this simple architecture on three standard document classification datasets.
Score: 4.38566347001872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite being the current de-facto models in most NLP tasks, transformers are often limited to short sequences due to their quadratic attention complexity on the number of tokens. Several attempts to address this issue were studied, either by reducing the cost of the self-attention computation or by modeling smaller sequences and combining them through a recurrence mechanism or using a new transformer model. In this paper, we suggest to take advantage of pre-trained sentence transformers to start from semantically meaningful embeddings of the individual sentences, and then combine them through a small attention layer that scales linearly with the document length. We report the results obtained by this simple architecture on three standard document classification datasets. When compared with the current state-of-the-art models using standard fine-tuning, the studied method obtains competitive results (even if there is no clear best model in this configuration). We also showcase that the studied architecture obtains better results when freezing the underlying transformers. A configuration that is useful when we need to avoid complete fine-tuning (e.g. when the same frozen transformer is shared by different applications). Finally, two additional experiments are provided to further evaluate the relevancy of the studied architecture over simpler baselines.

Related papers

Small transformer architectures for task switching [2.7195102129095003]
It is non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches.<n>We show that standard transformers cannot solve a basic task switching reference model.<n>We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies.
arXiv Detail & Related papers (2025-08-06T14:01:05Z)
Exploring Diffusion Transformer Designs via Grafting [82.91123758506876]
We present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets.<n>We show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.
arXiv Detail & Related papers (2025-06-05T17:59:40Z)
Comateformer: Combined Attention Transformer for Semantic Sentence Matching [11.746010399185437]
We propose a novel semantic sentence matching model named Combined Attention Network based on Transformer model (Comateformer) In Comateformer model, we design a novel transformer-based quasi-attention mechanism with compositional properties. Our proposed approach builds on the intuition of similarity and dissimilarity (negative affinity) when calculating dual affinity scores.
arXiv Detail & Related papers (2024-12-10T06:18:07Z)
Transformer-based Models for Long-Form Document Matching: Challenges and Empirical Analysis [12.269318291685753]
We show that simple neural models outperform the more complex BERT-based models. Simple models are also more robust to variations in document length and text perturbations.
arXiv Detail & Related papers (2023-02-07T21:51:05Z)
Mutual Exclusivity Training and Primitive Augmentation to Induce Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models. We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples. We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z)
Mixed-effects transformers for hierarchical adaptation [1.9105318290910576]
We introduce the mixed-effects transformer (MET), a novel approach for learning hierarchically-structured prefixes. We show how the popular class of mixed-effects models may be extended to transformer-based architectures.
arXiv Detail & Related papers (2022-05-03T19:34:15Z)
Paragraph-based Transformer Pre-training for Multi-Sentence Inference [99.59693674455582]
We show that popular pre-trained transformers perform poorly when used for fine-tuning on multi-candidate inference tasks. We then propose a new pre-training objective that models the paragraph-level semantics across multiple input sentences.
arXiv Detail & Related papers (2022-05-02T21:41:14Z)
Causal Transformer for Estimating Counterfactual Outcomes [18.640006398066188]
Estimating counterfactual outcomes over time from observational data is relevant for many applications. We develop a novel Causal Transformer for estimating counterfactual outcomes over time. Our model is specifically designed to capture complex, long-range dependencies among time-varying confounders.
arXiv Detail & Related papers (2022-04-14T22:40:09Z)
Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects. Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency. We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization. Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z)
Long Range Arena: A Benchmark for Efficient Transformers [115.1654897514089]
Long-rangearena benchmark is a suite of tasks consisting of sequences ranging from $1K$ to $16K$ tokens. We systematically evaluate ten well-established long-range Transformer models on our newly proposed benchmark suite.
arXiv Detail & Related papers (2020-11-08T15:53:56Z)
Long-Short Term Masking Transformer: A Simple but Effective Baseline for Document-level Neural Machine Translation [28.94748226472447]
We study the pros and cons of the standard transformer in document-level translation. We propose a surprisingly simple long-short term masking self-attention on top of the standard transformer. We can achieve a strong result in BLEU and capture discourse phenomena.
arXiv Detail & Related papers (2020-09-19T00:29:51Z)
The Cascade Transformer: an Application for Efficient Answer Sentence Selection [116.09532365093659]
We introduce the Cascade Transformer, a technique to adapt transformer-based models into a cascade of rankers. When compared to a state-of-the-art transformer model, our approach reduces computation by 37% with almost no impact on accuracy.
arXiv Detail & Related papers (2020-05-05T23:32:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.