Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems
- URL: http://arxiv.org/abs/2109.03888v1
- Date: Wed, 8 Sep 2021 19:32:42 GMT
- Title: Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems
- Authors: Potsawee Manakul, Mark J. F. Gales
- Abstract summary: Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
- Score: 38.672160430296536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer models have achieved state-of-the-art results in a wide range of
NLP tasks including summarization. Training and inference using large
transformer models can be computationally expensive. Previous work has focused
on one important bottleneck, the quadratic self-attention mechanism in the
encoder. Modified encoder architectures such as LED or LoBART use local
attention patterns to address this problem for summarization. In contrast, this
work focuses on the transformer's encoder-decoder attention mechanism. The cost
of this attention becomes more significant in inference or training approaches
that require model-generated histories. First, we examine the complexity of the
encoder-decoder attention. We demonstrate empirically that there is a sparse
sentence structure in document summarization that can be exploited by
constraining the attention mechanism to a subset of input sentences, whilst
maintaining system performance. Second, we propose a modified architecture that
selects the subset of sentences to constrain the encoder-decoder attention.
Experiments are carried out on abstractive summarization tasks, including
CNN/DailyMail, XSum, Spotify Podcast, and arXiv.
Related papers
- $ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space.
Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input.
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - Efficient Sample-Specific Encoder Perturbations [37.84914870036184]
We show that a small proxy network can be used to find a sample-by-sample perturbation of the encoder output of a frozen foundation model.
Results show consistent improvements in performance evaluated through COMET and WER respectively.
arXiv Detail & Related papers (2024-05-01T08:55:16Z) - Take an Irregular Route: Enhance the Decoder of Time-Series Forecasting
Transformer [9.281993269355544]
We propose FPPformer to utilize bottom-up and top-down architectures in encoder and decoder to build the full and rational hierarchy.
Extensive experiments with six state-of-the-art benchmarks verify the promising performances of FPPformer.
arXiv Detail & Related papers (2023-12-10T06:50:56Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking
Inference [70.36083572306839]
This paper proposes a new training and inference paradigm for re-ranking.
We finetune a pretrained encoder-decoder model using in the form of document to query generation.
We show that this encoder-decoder architecture can be decomposed into a decoder-only language model during inference.
arXiv Detail & Related papers (2022-04-25T06:26:29Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - Hierarchical Attention Transformer Architecture For Syntactic Spell
Correction [1.0312968200748118]
We propose multi encoder-single decoder variation of conventional transformer.
We report significant improvement of 0.11%, 0.32% and 0.69% in character (CER), word (WER) and sentence (SER) error rates.
Our architecture is also trains 7.8 times faster, and is only about 1/3 in size from the next most accurate model.
arXiv Detail & Related papers (2020-05-11T06:19:01Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.