Discovering Non-monotonic Autoregressive Orderings with Variational
Inference
- URL: http://arxiv.org/abs/2110.15797v1
- Date: Wed, 27 Oct 2021 16:08:09 GMT
- Title: Discovering Non-monotonic Autoregressive Orderings with Variational
Inference
- Authors: Xuanlin Li, Brandon Trabucco, Dong Huk Park, Michael Luo, Sheng Shen,
Trevor Darrell, Yang Gao
- Abstract summary: We develop an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data.
We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass.
Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
- Score: 67.27561153666211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The predominant approach for language modeling is to process sequences from
left to right, but this eliminates a source of information: the order by which
the sequence was generated. One strategy to recover this information is to
decode both the content and ordering of tokens. Existing approaches supervise
content and ordering by designing problem-specific loss functions and
pre-training with an ordering pre-selected. Other recent works use iterative
search to discover problem-specific orderings for training, but suffer from
high time complexity and cannot be efficiently parallelized. We address these
limitations with an unsupervised parallelizable learner that discovers
high-quality generation orders purely from training data -- no domain knowledge
required. The learner contains an encoder network and decoder language model
that perform variational inference with autoregressive orders (represented as
permutation matrices) as latent variables. The corresponding ELBO is not
differentiable, so we develop a practical algorithm for end-to-end optimization
using policy gradients. We implement the encoder as a Transformer with
non-causal attention that outputs permutations in one forward pass.
Permutations then serve as target generation orders for training an
insertion-based Transformer language model. Empirical results in language
modeling tasks demonstrate that our method is context-aware and discovers
orderings that are competitive with or even better than fixed orders.
Related papers
- Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Set-Based Prompting: Provably Solving the Language Model Order Dependency Problem [18.020492646988746]
We present Set-Based Prompting, a technique that guarantees the output of an LLM will not have order dependence on a specified set of sub-sequences.
Despite our inputs being out of distribution, the impact on expected accuracy is small, where the expectation is over the order of uniformly chosen shuffling of the candidate responses.
arXiv Detail & Related papers (2024-06-04T16:09:13Z) - GEC-DePenD: Non-Autoregressive Grammatical Error Correction with
Decoupled Permutation and Decoding [52.14832976759585]
Grammatical error correction (GEC) is an important NLP task that is usually solved with autoregressive sequence-to-sequence models.
We propose a novel non-autoregressive approach to GEC that decouples the architecture into a permutation network.
We show that the resulting network improves over previously known non-autoregressive methods for GEC.
arXiv Detail & Related papers (2023-11-14T14:24:36Z) - Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so.
We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed.
Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - Learning and Analyzing Generation Order for Undirected Sequence Models [86.10875837475783]
We train a policy that learns the generation order for a pre-trained, undirected translation model via reinforcement learning.
We show that the translations by our learned orders achieve higher BLEU scores than the outputs decoded from left to right or decoded by the learned order from Mansimov et al.
Our findings could provide more insights on the mechanism of undirected generation models and encourage further research in this direction.
arXiv Detail & Related papers (2021-12-16T18:29:07Z) - SparseGAN: Sparse Generative Adversarial Network for Text Generation [8.634962333084724]
We propose a SparseGAN that generates semantic-interpretable, but sparse sentence representations as inputs to the discriminator.
With such semantic-rich representations, we not only reduce unnecessary noises for efficient adversarial training, but also make the entire training process fully differentiable.
arXiv Detail & Related papers (2021-03-22T04:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.