Efficient Joint Prediction of Multiple Future Tokens
- URL: http://arxiv.org/abs/2503.21801v1
- Date: Mon, 24 Mar 2025 19:52:42 GMT
- Title: Efficient Joint Prediction of Multiple Future Tokens
- Authors: Kwangjun Ahn, Alex Lamb, John Langford,
- Abstract summary: We introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction.<n>Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens.<n>We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so.
- Score: 20.647830092055955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this short report, we introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction designed to enrich hidden state representations by jointly predicting multiple future tokens. Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens through a carefully designed representation bottleneck, allowing the model to encode rich predictive information with minimal computational overhead during training. We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so. We demonstrate the effectiveness of our method on the synthetic star graph navigation task from from Bachmann and Nagarajan [2024], highlighting a significant performance improvement over existing methods. This manuscript presents promising preliminary results intended to stimulate further research.
Related papers
- Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition [4.605037293860087]
We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time.
We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate.
arXiv Detail & Related papers (2025-04-04T11:06:09Z) - Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens.<n>We propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions.<n>Our method significantly boosts prediction accuracy and achieves higher inference speedups.
arXiv Detail & Related papers (2025-02-10T09:24:06Z) - Improving Next Tokens via Second-to-Last Predictions with Generate and Refine [1.8592384822257952]
We train a decoder-only architecture for predicting the second to last token for a sequence of tokens.<n>Our approach yields higher computational training efficiency than BERT-style models.
arXiv Detail & Related papers (2024-11-23T22:09:58Z) - FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step.
We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z) - Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction [0.0]
This research investigates a new pretraining method called Future Token Prediction (FTP)
FTP generates embedding vectors for each token position that are linearly and expansively projected to a pseudo-sequence.
On a toy, but complex, coding problem, FTP networks produce significantly better results than GPT networks.
arXiv Detail & Related papers (2024-10-23T14:50:15Z) - Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.<n>We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction [79.78050867137594]
Diffusion, masked-token prediction, and next-token prediction all use a Transformer network architecture.
We analyze the scalability of each approach through the lens of compute budget measured in FLOPs.
We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following.
arXiv Detail & Related papers (2024-05-21T21:49:39Z) - Aligned Contrastive Predictive Coding [10.521845940927163]
We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations.
Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than that of the upcoming representations to which they will be aligned.
arXiv Detail & Related papers (2021-04-24T13:07:22Z) - Ambiguity in Sequential Data: Predicting Uncertain Futures with
Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data.
We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z) - ProphetNet: Predicting Future N-gram for Sequence-to-Sequence
Pre-training [85.35910219651572]
We present a new sequence-to-sequence pre-training model called ProphetNet.
It introduces a novel self-supervised objective named future n-gram prediction.
We conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks.
arXiv Detail & Related papers (2020-01-13T05:12:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.