Related papers: Reasoning Bias of Next Token Prediction Training

Reasoning Bias of Next Token Prediction Training

URL: http://arxiv.org/abs/2502.02007v2
Date: Thu, 20 Feb 2025 03:38:38 GMT
Title: Reasoning Bias of Next Token Prediction Training
Authors: Pengxiao Lin, Zhongwang Zhang, Zhi-Qin John Xu,
Abstract summary: Next token prediction (NTP) is the dominant training paradigm for Large Language Models (LLMs)<n>We show that despite NTP's exposure to noise during training, it surpasses in reasoning ability.<n>We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics.
Score: 5.188841610098436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since the inception of Large Language Models (LLMs), the quest to efficiently train them for superior reasoning capabilities has been a pivotal challenge. The dominant training paradigm for LLMs is based on next token prediction (NTP). Alternative methodologies, called Critical Token Prediction (CTP), focused exclusively on specific critical tokens (such as the answer in Q\&A dataset), aiming to reduce the overfitting of extraneous information and noise. Contrary to initial assumptions, our research reveals that despite NTP's exposure to noise during training, it surpasses CTP in reasoning ability. We attribute this counterintuitive outcome to the regularizing influence of noise on the training dynamics. Our empirical analysis shows that NTP-trained models exhibit enhanced generalization and robustness across various benchmark reasoning datasets, demonstrating greater resilience to perturbations and achieving flatter loss minima. These findings illuminate that NTP is instrumental in fostering reasoning abilities during pretraining, whereas CTP is more effective for finetuning, thereby enriching our comprehension of optimal training strategies in LLM development.

Related papers

Mitigating Spurious Correlations in LLMs via Causality-Aware Post-Training [57.03005244917803]
Large language models (LLMs) often fail on out-of-distribution (OOD) samples due to spurious correlations acquired during pre-training.<n>Here, we aim to mitigate such spurious correlations through causality-aware post-training (CAPT)<n> Experiments on the formal causal inference benchmark CLadder and the logical reasoning dataset PrOntoQA show that 3B-scale language models fine-tuned with CAPT can outperform both traditional SFT and larger LLMs on in-distribution (ID) and OOD tasks.
arXiv Detail & Related papers (2025-06-11T06:30:28Z)
Pre-Training Curriculum for Multi-Token Prediction in Language Models [2.8071268036220003]
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models.<n>We propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum and a reverse curriculum.
arXiv Detail & Related papers (2025-05-28T18:19:18Z)
Learning Dynamics in Continual Pre-Training for Large Language Models [4.192010912385391]
Continual Pre-Training (CPT) has become a popular method to apply strong foundation models to specific downstream tasks.<n>We focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses.<n>Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc.
arXiv Detail & Related papers (2025-05-12T17:47:32Z)
An Active Inference perspective on Neurofeedback Training [0.6008132390640294]
Neurofeedback training (NFT) aims to teach self-regulation of brain activity through real-time feedback.<n>NFT suffers from highly variable outcomes and poorly understood mechanisms, hampering its validation.<n>We propose a formal computational model of the NFT closed loop.
arXiv Detail & Related papers (2025-05-06T08:41:31Z)
Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs? [32.04523360747506]
We construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations. We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50%.
arXiv Detail & Related papers (2025-04-16T21:19:09Z)
Cognitive-Mental-LLM: Evaluating Reasoning in Large Language Models for Mental Health Prediction via Online Text [0.0]
This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases.
arXiv Detail & Related papers (2025-03-13T06:42:37Z)
The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency. UPFT removes the need for labeled data or exhaustive sampling. Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
On multi-token prediction for efficient LLM inference [0.36681882674260474]
We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities. We then explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP.
arXiv Detail & Related papers (2025-02-13T15:42:44Z)
Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z)
Non-asymptotic Convergence of Training Transformers for Next-token Prediction [48.9399496805422]
Transformers have achieved extraordinary success in modern machine learning due to their excellent ability to handle sequential data. This paper provides a fine-grained non-asymptotic analysis of the training dynamics of a one-layer transformer. We show that the trained transformer presents non-token prediction ability with dataset shift.
arXiv Detail & Related papers (2024-09-25T20:22:06Z)
HiDe-PET: Continual Learning via Hierarchical Decomposition of Parameter-Efficient Tuning [55.88910947643436]
We propose a unified framework for continual learning (CL) with pre-trained models (PTMs) and parameter-efficient tuning (PET) We present Hierarchical Decomposition PET (HiDe-PET), an innovative approach that explicitly optimize the objective through incorporating task-specific and task-shared knowledge. Our approach demonstrates remarkably superior performance over a broad spectrum of recent strong baselines.
arXiv Detail & Related papers (2024-07-07T01:50:25Z)
Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [61.98556945939045]
We propose a framework to learn planning-based reasoning through Direct Preference Optimization (DPO) on collected trajectories. Our results on challenging logical reasoning benchmarks demonstrate the effectiveness of our learning framework.
arXiv Detail & Related papers (2024-02-01T15:18:33Z)
Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice. HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics. Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z)
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z)
Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner [14.975436239088312]
We re-visit the notion in NLP that continued pre-training improves the performance of fine-tuning (FT) in downstream tasks. We propose Prompt-based Continued Pre-training (PCP), which combines the idea of instruction tuning with conventional continued pre-training. Our empirical evaluations on 21 benchmarks demonstrate that the PCP consistently improves the performance of state-of-the-art prompt-based FT approaches.
arXiv Detail & Related papers (2023-05-02T18:25:30Z)
CARE: Certifiably Robust Learning with Reasoning via Variational Inference [26.210129662748862]
We propose a certifiably robust learning with reasoning pipeline (CARE) CARE achieves significantly higher certified robustness compared with the state-of-the-art baselines. We additionally conducted different ablation studies to demonstrate the empirical robustness of CARE and the effectiveness of different knowledge integration.
arXiv Detail & Related papers (2022-09-12T07:15:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.