Related papers: Improving Next Tokens via Second-Last Predictions with Generate and Refine

Improving Next Tokens via Second-Last Predictions with Generate and Refine

URL: http://arxiv.org/abs/2411.15661v1
Date: Sat, 23 Nov 2024 22:09:58 GMT
Title: Improving Next Tokens via Second-Last Predictions with Generate and Refine
Authors: Johannes Schneider,
Abstract summary: We train a decoder only architecture for predicting the second last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models.
Score: 1.8592384822257952
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive language models like GPT aim at predicting next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder only architecture for predicting the second last token for a sequence of tokens. Our approach yields higher computational training efficiency than BERT-style models by employing a structured deterministic approach towards masking tokens. We use our model to improve the next token predictions of a standard GPT by combining both predictions in a ``generate-then-refine'' approach. We show on different variants of GPT-2 and different datasets that (not unexpectedly) second last token predictions are much more accurate, i.e., more than 15\% higher accuracy than ordinary next token predictors. The ``generate-then-refine'' approach also demonstrates notable improvements in next-token predictions, yielding smaller yet consistent and significant gains.

Related papers

Efficient Joint Prediction of Multiple Future Tokens [20.647830092055955]
We introduce joint multi-token prediction (JTP), a lightweight modification of standard next-token prediction. Unlike previous multi-token prediction approaches, JTP strategically employs teacher forcing of future-tokens. We show that the JTP approach achieves a short-horizon belief state representation, while popular alternatives for multi-token prediction fail to do so.
arXiv Detail & Related papers (2025-03-24T19:52:42Z)
Predicting Through Generation: Why Generation Is Better for Prediction [10.098410272203301]
This paper argues that generating output tokens is more effective than using pooled representations for prediction tasks because token-level generation retains more mutual information. We introduce PredGen, an end to end framework that (i) uses scheduled sampling to reduce exposure bias, and (ii) introduces a task adapter to convert the generated tokens into structured outputs. Our results show that PredGen consistently outperforms standard baselines, demonstrating its effectiveness in structured prediction tasks.
arXiv Detail & Related papers (2025-02-25T03:48:19Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
Future Token Prediction -- Causal Language Modelling with Per-Token Semantic State Vector for Multi-Token Prediction [0.0]
This research investigates a new pretraining method called Future Token Prediction (FTP) FTP generates embedding vectors for each token position that are linearly and expansively projected to a pseudo-sequence. On a toy, but complex, coding problem, FTP networks produce significantly better results than GPT networks.
arXiv Detail & Related papers (2024-10-23T14:50:15Z)
Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy. By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction [79.78050867137594]
Diffusion, masked-token prediction, and next-token prediction all use a Transformer network architecture. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following.
arXiv Detail & Related papers (2024-05-21T21:49:39Z)
Better & Faster Large Language Models via Multi-token Prediction [29.067271500844928]
Large language models such as GPT and Llama are trained with a next-token prediction loss. We suggest that training language models to predict multiple future tokens at once results in higher sample efficiency.
arXiv Detail & Related papers (2024-04-30T17:33:57Z)
Exploring the Role of Token in Transformer-based Time Series Forecasting [10.081240480138487]
Transformer-based methods are a mainstream approach for solving time series forecasting (TSF) Most focus on optimizing the model structure, with few studies paying attention to the role of tokens for predictions. We find that the gradients mainly depend on tokens that contribute to the predicted series, called positive tokens. To utilize T-PE and V-PE, we propose T2B-PE, a Transformer-based dual-branch framework.
arXiv Detail & Related papers (2024-04-16T07:21:39Z)
Object Recognition as Next Token Prediction [99.40793702627396]
We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.
arXiv Detail & Related papers (2023-12-04T18:58:40Z)
Fast End-to-End Speech Recognition via a Non-Autoregressive Model and Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once) The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z)
ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training [85.35910219651572]
We present a new sequence-to-sequence pre-training model called ProphetNet. It introduces a novel self-supervised objective named future n-gram prediction. We conduct experiments on CNN/DailyMail, Gigaword, and SQuAD 1.1 benchmarks for abstractive summarization and question generation tasks.
arXiv Detail & Related papers (2020-01-13T05:12:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.