The Belief State Transformer
- URL: http://arxiv.org/abs/2410.23506v2
- Date: Thu, 20 Feb 2025 04:44:32 GMT
- Title: The Belief State Transformer
- Authors: Edward S. Hu, Kwangjun Ahn, Qinghua Liu, Haoran Xu, Manan Tomar, Ada Langford, Dinesh Jayaraman, Alex Lamb, John Langford,
- Abstract summary: "Belief State Transformer" is a next-token predictor that takes both a prefix and suffix as inputs.
It effectively learns to solve challenging problems that conventional forward-only transformers struggle with.
- Score: 50.196123952714245
- License:
- Abstract: We introduce the "Belief State Transformer", a next-token predictor that takes both a prefix and suffix as inputs, with a novel objective of predicting both the next token for the prefix and the previous token for the suffix. The Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with, in a domain-independent fashion. Key to this success is learning a compact belief state that captures all relevant information necessary for accurate predictions. Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown. Altogether, the Belief State Transformer enables more efficient goal-conditioned decoding, better test-time inference, and high-quality text representations on small scale problems. Website: https://sites.google.com/view/belief-state-transformer
Related papers
- Looking Beyond The Top-1: Transformers Determine Top Tokens In Order [13.032106683136394]
We analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed.
We find that these saturation events happen in order of the corresponding tokens' ranking.
We propose an underlying mechanism of task transition for this sequential saturation.
arXiv Detail & Related papers (2024-10-26T16:00:38Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - Bird-Eye Transformers for Text Generation Models [49.47825106383972]
We propose a new architecture, called bird-eye transformer(BET), which goes one step further to improve the performance of transformers.
Our proposed model achieves a better performance than the baseline transformer architectures onalldatasets.
arXiv Detail & Related papers (2022-10-08T09:51:15Z) - Transformer-F: A Transformer network with effective methods for learning
universal sentence representation [8.225067988604351]
The Transformer model is widely used in natural language processing for sentence representation.
In this paper, two approaches are introduced to improve the performance of Transformers.
arXiv Detail & Related papers (2021-07-02T03:20:11Z) - Transformer visualization via dictionary learning: contextualized
embedding as a linear superposition of transformer factors [15.348047288817478]
We propose to use dictionary learning to open up "black boxes" as linear superpositions of transformer factors.
Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors.
We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work.
arXiv Detail & Related papers (2021-03-29T20:51:33Z) - Position Information in Transformers: An Overview [6.284464997330884]
This paper provides an overview of common methods to incorporate position information into Transformer models.
The objectives of this survey are to showcase that position information in Transformer is a vibrant and extensive research area.
arXiv Detail & Related papers (2021-02-22T15:03:23Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z) - Robustness Verification for Transformers [165.25112192811764]
We develop the first robustness verification algorithm for Transformers.
The certified robustness bounds computed by our method are significantly tighter than those by naive Interval Bound propagation.
These bounds also shed light on interpreting Transformers as they consistently reflect the importance of different words in sentiment analysis.
arXiv Detail & Related papers (2020-02-16T17:16:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.