Attentive Multi-Layer Perceptron for Non-autoregressive Generation
- URL: http://arxiv.org/abs/2310.09512v1
- Date: Sat, 14 Oct 2023 06:44:24 GMT
- Title: Attentive Multi-Layer Perceptron for Non-autoregressive Generation
- Authors: Shuyang Jiang and Jun Zhang and Jiangtao Feng and Lin Zheng and
Lingpeng Kong
- Abstract summary: Non-autoregressive(NAR) generation gains increasing popularity for its efficiency and growing efficacy.
In this paper, we propose a novel variant, textbfAttentive textbfMulti-textbfLayer textbfPerceptron(AMLP), to produce a generation model with linear time and space complexity.
- Score: 46.14195464583495
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Autoregressive~(AR) generation almost dominates sequence generation for its
efficacy. Recently, non-autoregressive~(NAR) generation gains increasing
popularity for its efficiency and growing efficacy. However, its efficiency is
still bottlenecked by quadratic complexity in sequence lengths, which is
prohibitive for scaling to long sequence generation and few works have been
done to mitigate this problem. In this paper, we propose a novel MLP variant,
\textbf{A}ttentive \textbf{M}ulti-\textbf{L}ayer \textbf{P}erceptron~(AMLP), to
produce a generation model with linear time and space complexity. Different
from classic MLP with static and learnable projection matrices, AMLP leverages
adaptive projections computed from inputs in an attentive mode. The
sample-aware adaptive projections enable communications among tokens in a
sequence, and model the measurement between the query and key space.
Furthermore, we marry AMLP with popular NAR models, deriving a highly efficient
NAR-AMLP architecture with linear time and space complexity. Empirical results
show that such marriage architecture surpasses competitive efficient NAR
models, by a significant margin on text-to-speech synthesis and machine
translation. We also test AMLP's self- and cross-attention ability separately
with extensive ablation experiments, and find them comparable or even superior
to the other efficient models. The efficiency analysis further shows that AMLP
extremely reduces the memory cost against vanilla non-autoregressive models for
long sequences.
Related papers
- Universality of Linear Recurrences Followed by Non-linear Projections: Finite-Width Guarantees and Benefits of Complex Eigenvalues [32.783917920167205]
We show that combining architectures with both real or complex linear diagonal recurrences leads to arbitrarily precise approximation of sequence-to-sequence maps.
We prove that employing complex eigenvalues near unit disk - i.e., the most successful strategy in S4 - greatly helps the RNN in storing information.
arXiv Detail & Related papers (2023-07-21T20:09:06Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and
Effective Text Generation [97.64625999380425]
We study the text generation task under the approach of pre-trained language models (PLMs)
By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence.
Experiments on three text generation tasks show that ELMER significantly outperforms NAR models.
arXiv Detail & Related papers (2022-10-24T14:46:47Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - MOI-Mixer: Improving MLP-Mixer with Multi Order Interactions in
Sequential Recommendation [40.20599070308035]
Transformer-based models require quadratic memory and time complexity to the sequence length, making it difficult to extract the long-term interest of users.
MLP-based models, renowned for their linear memory and time complexity, have recently shown competitive results compared to Transformer in various tasks.
We propose the Multi-Order Interaction layer, which is capable of expressing an arbitrary order of interactions while maintaining the memory and time complexity of the layer.
arXiv Detail & Related papers (2021-08-17T08:38:49Z) - Bayesian Inference in High-Dimensional Time-Serieswith the Orthogonal
Stochastic Linear Mixing Model [2.7909426811685893]
Many modern time-series datasets contain large numbers of output response variables sampled for prolonged periods of time.
In this paper, we propose a new Markov chain Monte Carlo framework for the analysis of diverse, large-scale time-series datasets.
arXiv Detail & Related papers (2021-06-25T01:12:54Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - An EM Approach to Non-autoregressive Conditional Sequence Generation [49.11858479436565]
Autoregressive (AR) models have been the dominating approach to conditional sequence generation.
Non-autoregressive (NAR) models have been recently proposed to reduce the latency by generating all output tokens in parallel.
This paper proposes a new approach that jointly optimize both AR and NAR models in a unified Expectation-Maximization framework.
arXiv Detail & Related papers (2020-06-29T20:58:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.