Value-Based Pre-Training with Downstream Feedback
- URL: http://arxiv.org/abs/2601.22108v1
- Date: Thu, 29 Jan 2026 18:38:09 GMT
- Title: Value-Based Pre-Training with Downstream Feedback
- Authors: Shuqi Ke, Giulia Fanti,
- Abstract summary: V-Pretraining is a value-based, modality-agnostic method for controlled continued pretraining.<n>A lightweight task designer reshapes the pretraining task to maximize the value of each gradient step.<n>Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning by up to 18% relative over standard next-token prediction.
- Score: 13.861427289106715
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.
Related papers
- RLP: Reinforcement as a Pretraining Objective [103.45068938532923]
We present an information-driven reinforcement pretraining objective that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining.<n>This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining.<n> Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning.
arXiv Detail & Related papers (2025-09-26T17:53:54Z) - Bootstrapping your behavior: a new pretraining strategy for user behavior sequence data [20.293837889640507]
We introduce Bootstrapping Your Behavior (model), a novel UBS pretraining strategy that predicts an automatically constructed supervision embedding all behaviors' information within a future time window.<n>Experiments on two real-world industrial datasets and eight downstream tasks demonstrate that model achieves an average improvement of 3.9% in AUC and 98.9% in training throughput.
arXiv Detail & Related papers (2025-05-22T11:23:38Z) - Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training.
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z) - Pre-Pruning and Gradient-Dropping Improve Differentially Private Image
Classification [9.120531252536617]
We introduce a new training paradigm that uses textitpre-pruning and textitgradient-dropping to reduce the parameter space and improve scalability.
Our training paradigm introduces a tension between the rates of pre-pruning and gradient-dropping, privacy loss, and classification accuracy.
arXiv Detail & Related papers (2023-06-19T14:35:28Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - A Closer Look at Self-Supervised Lightweight Vision Transformers [44.44888945683147]
Self-supervised learning on large-scale Vision Transformers (ViTs) as pre-training methods has achieved promising downstream performance.
We benchmark several self-supervised pre-training methods on image classification tasks and some downstream dense prediction tasks.
Even vanilla lightweight ViTs show comparable performance to previous SOTA networks with delicate architecture design.
arXiv Detail & Related papers (2022-05-28T14:14:57Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition [44.347739529374124]
Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR)
Most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks.
We propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap.
arXiv Detail & Related papers (2021-10-09T07:09:22Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z) - LogME: Practical Assessment of Pre-trained Models for Transfer Learning [80.24059713295165]
The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning.
Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
arXiv Detail & Related papers (2021-02-22T13:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.