Enhancing next token prediction based pre-training for jet foundation models
- URL: http://arxiv.org/abs/2512.04149v1
- Date: Wed, 03 Dec 2025 19:00:00 GMT
- Title: Enhancing next token prediction based pre-training for jet foundation models
- Authors: Joschka Birk, Anna Hallin, Gregor Kasieczka, Nikol Madzharova, Ian Pang, David Shih,
- Abstract summary: Next token prediction is an attractive pre-training task for jet foundation models.<n>It is simulation free and enables excellent generative capabilities that can transfer across datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Next token prediction is an attractive pre-training task for jet foundation models, in that it is simulation free and enables excellent generative capabilities that can transfer across datasets. Here we study multiple improvements to next token prediction, building on the initial work of OmniJet-$α$. Instead of tokenizing particles and subsequently only using the token-ID as the model input for both the generative and the classification task, we adopt a hybrid setup, which allows us to use continuous feature vectors as model input while only using token-IDs in the next token prediction target. Secondly, we explore a combined pre-training strategy that combines masked particle modeling and generative learning objectives. Taken together, these changes greatly improve the performance in downstream classification tasks without any loss in generative performance.
Related papers
- Multi-Token Prediction via Self-Distillation [73.81494481537636]
We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model.<n>On GSM8K, our method produces models that can decode more than $3times$ faster on average at $5%$ drop in accuracy relative to single token decoding performance.
arXiv Detail & Related papers (2026-02-05T18:54:48Z) - Next-Embedding Prediction Makes Strong Vision Learners [68.55755328850634]
We train models to generate embeddings to perform predictive tasks directly.<n>Next-Embedding Predictive Autoregression (NEPA) achieves strong results across tasks.<n>We believe generative pretraining from embeddings provides a simple, scalable, and potentially modality-agnostic alternative to visual self-supervised learning.
arXiv Detail & Related papers (2025-12-18T18:59:58Z) - Quadratic Direct Forecast for Training Multi-Step Time-Series Forecast Models [88.18038107198218]
Existing training objectives mostly treat each future step as an independent, equally weighted task.<n>We propose a novel quadratic-form weighted training objective, addressing both of the issues simultaneously.<n> Experiments show that our QDF effectively improves performance of various forecast models.
arXiv Detail & Related papers (2025-10-28T14:48:25Z) - Text Generation Beyond Discrete Token Sampling [74.06071135207635]
Mixture of Inputs (MoI) is a training-free method for autoregressive generation.<n>MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B.
arXiv Detail & Related papers (2025-05-20T18:41:46Z) - Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition [5.575078692353885]
We propose a new model for multi-token prediction in transformers, aiming to enhance sampling efficiency without compromising accuracy.<n>By generalizing it to a rank-$r$ canonical probability decomposition, we develop an improved model that predicts multiple tokens simultaneously.
arXiv Detail & Related papers (2024-10-23T11:06:36Z) - Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens [45.745443096804586]
Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset.<n>During inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one.<n>This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time.
arXiv Detail & Related papers (2024-10-18T17:48:27Z) - Is Tokenization Needed for Masked Particle Modelling? [8.79008927474707]
Masked particle modeling (MPM) is a self-supervised learning scheme for constructing expressive representations of unordered sets.
We improve MPM by addressing inefficiencies in the implementation and incorporating a more powerful decoder.
We show that these new methods outperform the tokenized learning objective from the original MPM on a new test bed for foundation models for jets.
arXiv Detail & Related papers (2024-09-19T09:12:29Z) - Semformer: Transformer Language Models with Semantic Planning [18.750863564495006]
Next-token prediction serves as the dominant component in current neural language models.
We introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response.
arXiv Detail & Related papers (2024-09-17T12:54:34Z) - TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation [65.65530016765615]
We propose a hierarchical predictive coding framework that captures multi-scale dependencies through three complementary learning objectives.<n> TokenUnify integrates random token prediction, next-token prediction, and next-all token prediction to create a comprehensive representational space.<n>We also introduce a large-scale EM dataset with 1.2 billion annotated voxels, offering ideal long-sequence visual data with spatial continuity.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Unlocking the Transferability of Tokens in Deep Models for Tabular Data [67.11727608815636]
Fine-tuning a pre-trained deep neural network has become a successful paradigm in various machine learning tasks.
In this paper, we propose TabToken, a method aims at enhancing the quality of feature tokens.
We introduce a contrastive objective that regularizes the tokens, capturing the semantics within and across features.
arXiv Detail & Related papers (2023-10-23T17:53:09Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models.
We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead.
This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.