Learning to Achieve Goals with Belief State Transformers
- URL: http://arxiv.org/abs/2410.23506v1
- Date: Wed, 30 Oct 2024 23:26:06 GMT
- Title: Learning to Achieve Goals with Belief State Transformers
- Authors: Edward S. Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Dinesh Jayaraman, Alex Lamb, John Langford,
- Abstract summary: "Belief State Transformer" is a next-token predictor that takes both a prefix and suffix as inputs.
Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with.
- Score: 50.196123952714245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the "Belief State Transformer", a next-token predictor that takes both a prefix and suffix as inputs, with a novel objective of predicting both the next token for the prefix and the previous token for the suffix. The Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with, in a domain-independent fashion. Key to this success is learning a compact belief state that captures all relevant information necessary for accurate predictions. Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown. Altogether, the Belief State Transformer enables more efficient goal-conditioned decoding, better test-time inference, and high-quality text representations on small scale problems.
Related papers
- One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [48.4979348643494]
We study the capability of one-layer transformers learning the one-nearest neighbor rule.
A single softmax attention layer can successfully learn to behave like a one-nearest neighbor.
arXiv Detail & Related papers (2024-11-16T16:12:42Z) - Looking Beyond The Top-1: Transformers Determine Top Tokens In Order [13.032106683136394]
We analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed.
We find that these saturation events happen in order of the corresponding tokens' ranking.
We propose an underlying mechanism of task transition for this sequential saturation.
arXiv Detail & Related papers (2024-10-26T16:00:38Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - Bird-Eye Transformers for Text Generation Models [49.47825106383972]
We propose a new architecture, called bird-eye transformer(BET), which goes one step further to improve the performance of transformers.
Our proposed model achieves a better performance than the baseline transformer architectures onalldatasets.
arXiv Detail & Related papers (2022-10-08T09:51:15Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - XAI for Transformers: Better Explanations through Conservative
Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction.
Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z) - Towards the Unseen: Iterative Text Recognition by Distilling from Errors [41.43280922432707]
Prior arts mostly struggle with recognising unseen (or rarely seen) character sequences.
We put forward a novel framework to tackle this "unseen" problem.
Key to our success is a unique cross-modal variational autoencoder.
arXiv Detail & Related papers (2021-07-26T10:06:42Z) - Transformer-F: A Transformer network with effective methods for learning
universal sentence representation [8.225067988604351]
The Transformer model is widely used in natural language processing for sentence representation.
In this paper, two approaches are introduced to improve the performance of Transformers.
arXiv Detail & Related papers (2021-07-02T03:20:11Z) - Transformer visualization via dictionary learning: contextualized
embedding as a linear superposition of transformer factors [15.348047288817478]
We propose to use dictionary learning to open up "black boxes" as linear superpositions of transformer factors.
Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors.
We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work.
arXiv Detail & Related papers (2021-03-29T20:51:33Z) - Position Information in Transformers: An Overview [6.284464997330884]
This paper provides an overview of common methods to incorporate position information into Transformer models.
The objectives of this survey are to showcase that position information in Transformer is a vibrant and extensive research area.
arXiv Detail & Related papers (2021-02-22T15:03:23Z) - Is Supervised Syntactic Parsing Beneficial for Language Understanding?
An Empirical Investigation [71.70562795158625]
Traditional NLP has long held (supervised) syntactic parsing necessary for successful higher-level semantic language understanding (LU)
Recent advent of end-to-end neural models, self-supervised via language modeling (LM), and their success on a wide range of LU tasks, questions this belief.
We empirically investigate the usefulness of supervised parsing for semantic LU in the context of LM-pretrained transformer networks.
arXiv Detail & Related papers (2020-08-15T21:03:36Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z) - Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers.
The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation.
These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.