Related papers: $V_0$: A Generalist Value Model for Any Policy at State Zero

$V_0$: A Generalist Value Model for Any Policy at State Zero

URL: http://arxiv.org/abs/2602.03584v1
Date: Tue, 03 Feb 2026 14:35:23 GMT
Title: $V_0$: A Generalist Value Model for Any Policy at State Zero
Authors: Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye,
Abstract summary: Policy methods rely on a baseline to measure the relative advantage of an action.<n>This baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself.<n>We propose a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts.
Score: 80.7505802128501
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence $V_0$), our model serves as a critical resource scheduler. During GRPO training, $V_0$ predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that $V_0$ significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.

Related papers

Reinforcement Fine-Tuning of Flow-Matching Policies for Vision-Language-Action Models [7.316631310935769]
Vision-Language-Action (VLA) models have shown strong generalization by leveraging large-scale demonstrations.<n>We propose Flow Policy Optimization (FPO) algorithm, which reformulates importance sampling by leveraging per-sample changes in the conditional flow-matching objective.<n>We evaluate FPO on the LIBERO benchmark and the ALOHA simulation task against supervised, preference-aligned, diffusion-based, autoregressive online RL.
arXiv Detail & Related papers (2025-10-11T03:11:18Z)
M3PO: Massively Multi-Task Model-Based Policy Optimization [46.42871544295734]
We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning framework.<n>M3PO addresses sample inefficiency in single-task settings and poor generalization in multi-task domains.<n>M3PO integrates an implicit world model, trained to predict task outcomes without observation reconstruction, with a hybrid exploration strategy.
arXiv Detail & Related papers (2025-06-26T21:39:01Z)
Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z)
COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL [50.385005413810084]
Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration. $textttCOPlanner$ is a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem.
arXiv Detail & Related papers (2023-10-11T06:10:07Z)
Structured Dynamic Pricing: Optimal Regret in a Global Shrinkage Model [50.06663781566795]
We consider a dynamic model with the consumers' preferences as well as price sensitivity varying over time. We measure the performance of a dynamic pricing policy via regret, which is the expected revenue loss compared to a clairvoyant that knows the sequence of model parameters in advance. Our regret analysis results not only demonstrate optimality of the proposed policy but also show that for policy planning it is essential to incorporate available structural information.
arXiv Detail & Related papers (2023-03-28T00:23:23Z)
Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization [41.774837419584735]
offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing since they can extract more learning signals from the logged dataset by learning a model of the environment.
arXiv Detail & Related papers (2022-10-07T20:13:50Z)
Bayesian regularization of empirical MDPs [11.3458118258705]
We take a Bayesian perspective and regularize the objective function of the Markov decision process with prior information. We evaluate our proposed algorithms on synthetic simulations and on real-world search logs of a large scale online shopping store.
arXiv Detail & Related papers (2022-08-03T22:02:50Z)
On Effective Scheduling of Model-based Reinforcement Learning [53.027698625496015]
We propose a framework named AutoMBPO to automatically schedule the real data ratio. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance.
arXiv Detail & Related papers (2021-11-16T15:24:59Z)
Evaluating model-based planning and planner amortization for continuous control [79.49319308600228]
We take a hybrid approach, combining model predictive control (MPC) with a learned model and model-free policy learning. We find that well-tuned model-free agents are strong baselines even for high DoF control problems. We show that it is possible to distil a model-based planner into a policy that amortizes the planning without any loss of performance.
arXiv Detail & Related papers (2021-10-07T12:00:40Z)
COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable. We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions. We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.