Related papers: Efficient Joint Prediction of Multiple Future Tokens

Efficient Joint Prediction of Multiple Future Tokens

URL: http://arxiv.org/abs/2503.21801v1
Date: Mon, 24 Mar 2025 19:52:42 GMT
Title: Efficient Joint Prediction of Multiple Future Tokens
Authors: Kwangjun Ahn, Alex Lamb, John Langford,
Abstract summary: We introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction.<n>Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens.<n>We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so.
Score: 20.647830092055955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this short report, we introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction designed to enrich hidden state representations by jointly predicting multiple future tokens. Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens through a carefully designed representation bottleneck, allowing the model to encode rich predictive information with minimal computational overhead during training. We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so. We demonstrate the effectiveness of our method on the synthetic star graph navigation task from from Bachmann and Nagarajan [2024], highlighting a significant performance improvement over existing methods. This manuscript presents promising preliminary results intended to stimulate further research.

Related papers

Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition [4.605037293860087]
We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate.
arXiv Detail & Related papers (2025-04-04T11:06:09Z)
Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE [15.003006630308517]
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens.<n>We propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions.<n>Our method significantly boosts prediction accuracy and achieves higher inference speedups.
arXiv Detail & Related papers (2025-02-10T09:24:06Z)
Improving Next Tokens via Second-to-Last Predictions with Generate and Refine [1.8592384822257952]
We train a decoder-only architecture for predicting the second to last token for a sequence of tokens.<n>Our approach yields higher computational training efficiency than BERT-style models.
arXiv Detail & Related papers (2024-11-23T22:09:58Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction [0.0]
This research investigates a new pretraining method called Future Token Prediction (FTP) FTP generates embedding vectors for each token position that are linearly and expansively projected to a pseudo-sequence. On a toy, but complex, coding problem, FTP networks produce significantly better results than GPT networks.
arXiv Detail & Related papers (2024-10-23T14:50:15Z)
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.<n>We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction [79.78050867137594]
Diffusion, masked-token prediction, and next-token prediction all use a Transformer network architecture. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following.
arXiv Detail & Related papers (2024-05-21T21:49:39Z)
Aligned Contrastive Predictive Coding [10.521845940927163]
We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations. Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than that of the upcoming representations to which they will be aligned.
arXiv Detail & Related papers (2021-04-24T13:07:22Z)
Ambiguity in Sequential Data: Predicting Uncertain Futures with Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data. We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z)
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training [85.35910219651572]
We present a new sequence-to-sequence pre-training model called ProphetNet. It introduces a novel self-supervised objective named future n-gram prediction. We conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks.
arXiv Detail & Related papers (2020-01-13T05:12:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.