Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
- URL: http://arxiv.org/abs/2311.04897v1
- Date: Wed, 8 Nov 2023 18:56:35 GMT
- Title: Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
- Authors: Koyena Pal, Jiuding Sun, Andrew Yuan, Byron C. Wallace, David Bau
- Abstract summary: We conjecture that hidden state vectors corresponding to individual input tokens encode information sufficient to accurately predict several tokens ahead.
We measure linear approximation and causal intervention methods in GPT-J-6B to evaluate the degree to which individual hidden states in the network contain signal rich enough to predict future hidden states and, ultimately, token outputs.
- Score: 37.214633779288114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We conjecture that hidden state vectors corresponding to individual input
tokens encode information sufficient to accurately predict several tokens
ahead. More concretely, in this paper we ask: Given a hidden (internal)
representation of a single token at position $t$ in an input, can we reliably
anticipate the tokens that will appear at positions $\geq t + 2$? To test this,
we measure linear approximation and causal intervention methods in GPT-J-6B to
evaluate the degree to which individual hidden states in the network contain
signal rich enough to predict future hidden states and, ultimately, token
outputs. We find that, at some layers, we can approximate a model's output with
more than 48% accuracy with respect to its prediction of subsequent tokens
through a single hidden state. Finally we present a "Future Lens" visualization
that uses these methods to create a new view of transformer states.
Related papers
- Language Models Can Predict Their Own Behavior [28.80639362933004]
We show that internal representation of input tokens alone can often precisely predict, not just the next token, but eventual behavior over the entire output sequence.
We leverage this capacity and learn probes on internal states to create early warning (and exit) systems.
Specifically, if the probes can confidently estimate the way the LM is going to behave, then the system will avoid generating tokens altogether and return the estimated behavior instead.
arXiv Detail & Related papers (2025-02-18T23:13:16Z) - Improving Next Tokens via Second-to-Last Predictions with Generate and Refine [1.8592384822257952]
We train a decoder-only architecture for predicting the second to last token for a sequence of tokens.
Our approach yields higher computational training efficiency than BERT-style models.
arXiv Detail & Related papers (2024-11-23T22:09:58Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction.
The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z) - Think before you speak: Training Language Models With Pause Tokens [73.61375226378712]
Language models generate responses by producing a series of tokens in immediate succession.
What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)th$ token?
We operationalize this idea by performing training and inference on language models with a (learnable) $textitpause$ token.
arXiv Detail & Related papers (2023-10-03T17:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.