On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
- URL: http://arxiv.org/abs/2004.11854v1
- Date: Fri, 24 Apr 2020 16:57:52 GMT
- Title: On Sparsifying Encoder Outputs in Sequence-to-Sequence Models
- Authors: Biao Zhang, Ivan Titov, Rico Sennrich
- Abstract summary: We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder.
The gates are regularized using the expected value of the sparsity-inducing L0penalty.
We investigate the effects of this sparsification on two machine translation and two summarization tasks.
- Score: 90.58793284654692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence-to-sequence models usually transfer all encoder outputs to the
decoder for generation. In this work, by contrast, we hypothesize that these
encoder outputs can be compressed to shorten the sequence delivered for
decoding. We take Transformer as the testbed and introduce a layer of
stochastic gates in-between the encoder and the decoder. The gates are
regularized using the expected value of the sparsity-inducing L0penalty,
resulting in completely masking-out a subset of encoder outputs. In other
words, via joint training, the L0DROP layer forces Transformer to route
information through a subset of its encoder states. We investigate the effects
of this sparsification on two machine translation and two summarization tasks.
Experiments show that, depending on the task, around 40-70% of source encodings
can be pruned without significantly compromising quality. The decrease of the
output length endows L0DROP with the potential of improving decoding
efficiency, where it yields a speedup of up to 1.65x on document summarization
tasks against the standard Transformer. We analyze the L0DROP behaviour and
observe that it exhibits systematic preferences for pruning certain word types,
e.g., function words and punctuation get pruned most. Inspired by these
observations, we explore the feasibility of specifying rule-based patterns that
mask out encoder outputs based on information such as part-of-speech tags, word
frequency and word position.
Related papers
- Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks [53.550782959908524]
We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and decomposable tasks.
Our method, prompt-in-decoder (PiD), encodes the input once and decodes the output in parallel, boosting both training and inference efficiency.
arXiv Detail & Related papers (2024-03-19T19:27:23Z) - DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder
Transformer Models [22.276574156358084]
We build a multi-exit encoder-decoder transformer model which is trained with deep supervision so that each of its decoder layers is capable of generating plausible predictions.
We show our approach can reduce overall inference latency by 30%-60% with comparable or even higher accuracy compared to baselines.
arXiv Detail & Related papers (2023-11-15T01:01:02Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Efficient Encoders for Streaming Sequence Tagging [13.692806815196077]
A naive application of state-of-the-art bidirectional encoders for streaming sequence tagging would require encoding each token from scratch for each new token in an incremental streaming input (like transcribed speech)
The lack of re-usability of previous computation leads to a higher number of Floating Point Operations (or FLOPs) and higher number of unnecessary label flips.
We present a Hybrid with Adaptive Restart (HEAR) that addresses these issues while maintaining the performance of bidirectional encoders over the offline (or complete) inputs.
arXiv Detail & Related papers (2023-01-23T02:20:39Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - A Generative Approach to Titling and Clustering Wikipedia Sections [12.154365109117025]
We evaluate transformer encoders with various decoders for information organization through a new task: generation of section headings for Wikipedia articles.
Our analysis shows that decoders containing attention mechanisms over the encoder output achieve high-scoring results by generating extractive text.
A decoder without attention better facilitates semantic encoding and can be used to generate section embeddings.
arXiv Detail & Related papers (2020-05-22T14:49:07Z) - Rethinking and Improving Natural Language Generation with Layer-Wise
Multi-View Decoding [59.48857453699463]
In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder.
Recent work has proposed to use representations from different encoder layers for diversified levels of information.
We propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences.
arXiv Detail & Related papers (2020-05-16T20:00:39Z) - Pseudo-Bidirectional Decoding for Local Sequence Transduction [31.05704333618685]
We propose a simple but versatile approach named Pseudo-Bidirectional Decoding (PBD) for LST tasks.
The proposed PBD approach provides right side context information for the decoder and models the inductive bias of LST tasks.
Experimental results on several benchmark datasets show that our approach consistently improves the performance of standard seq2seq models on LST tasks.
arXiv Detail & Related papers (2020-01-31T07:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.