Related papers: Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems

Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems

URL: http://arxiv.org/abs/2109.03888v1
Date: Wed, 8 Sep 2021 19:32:42 GMT
Title: Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems
Authors: Potsawee Manakul, Mark J. F. Gales
Abstract summary: Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. This work focuses on the transformer's encoder-decoder attention mechanism.
Score: 38.672160430296536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization. Training and inference using large transformer models can be computationally expensive. Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder. Modified encoder architectures such as LED or LoBART use local attention patterns to address this problem for summarization. In contrast, this work focuses on the transformer's encoder-decoder attention mechanism. The cost of this attention becomes more significant in inference or training approaches that require model-generated histories. First, we examine the complexity of the encoder-decoder attention. We demonstrate empirically that there is a sparse sentence structure in document summarization that can be exploited by constraining the attention mechanism to a subset of input sentences, whilst maintaining system performance. Second, we propose a modified architecture that selects the subset of sentences to constrain the encoder-decoder attention. Experiments are carried out on abstractive summarization tasks, including CNN/DailyMail, XSum, Spotify Podcast, and arXiv.

Related papers

$ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space. Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input. We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder. We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z)
Efficient Sample-Specific Encoder Perturbations [37.84914870036184]
We show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model. Results show consistent improvements in performance evaluated through COMET and WER respectively.
arXiv Detail & Related papers (2024-05-01T08:55:16Z)
Take an Irregular Route: Enhance the Decoder of Time-Series Forecasting Transformer [9.281993269355544]
We propose FPPformer to utilize bottom-up and top-down architectures in encoder and decoder to build the full and rational hierarchy. Extensive experiments with six state-of-the-art benchmarks verify the promising performances of FPPformer.
arXiv Detail & Related papers (2023-12-10T06:50:56Z)
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text. Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z)
Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z)
ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking. We finetune a pretrained encoder-decoder model using in the form of document to query generation. We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z)
Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder. We train a Transformer-based sequence encoder over a large set of short sequences. Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
Hierarchical Attention Transformer Architecture For Syntactic Spell Correction [1.0312968200748118]
We propose multi encoder-single decoder variation of conventional transformer. We report significant improvement of 0.11%, 0.32% and 0.69% in character (CER), word (WER) and sentence (SER) error rates. Our architecture is also trains 7.8 times faster, and is only about 1/3 in size from the next most accurate model.
arXiv Detail & Related papers (2020-05-11T06:19:01Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.