TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
- URL: http://arxiv.org/abs/2602.19313v1
- Date: Sun, 22 Feb 2026 19:25:48 GMT
- Title: TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
- Authors: Shirui Chen, Cole Harrison, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Ranjay Krishna,
- Abstract summary: We introduce TOPReward, a novel, probabilistically grounded temporal value function that estimates robotic task progress.<n>In zero-shot evaluations across 130+ distinct real-world tasks, TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL.<n>We demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.
- Score: 46.912038830356714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models [57.75717492488268]
Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models.<n>Supervised Finetuning (SFT) requires hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training.<n>We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations.
arXiv Detail & Related papers (2025-12-16T18:26:38Z) - SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models [42.89413870143421]
Vision-Language-Action (VLA) models excel in robotic manipulation but are constrained by their heavy reliance on expert demonstrations.<n>Current VLA-RL methods, including group-based optimization approaches, are crippled by severe reward sparsity.<n>We propose Self-Referential Policy Optimization (SRPO), a novel VLA-RL framework.
arXiv Detail & Related papers (2025-11-19T16:52:23Z) - A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning [26.546473157595482]
We introduce VLAC, a general process reward model built upon InternVL.<n>It outputs dense progress delta and done signal, eliminating task-specific reward engineering.<n>VLAC is trained on vision-language datasets to strengthen perception, dialogic and reasoning capabilities.
arXiv Detail & Related papers (2025-09-19T12:44:29Z) - VITA: Zero-Shot Value Functions via Test-Time Adaptation of Vision-Language Models [49.78447737655287]
VITA is a zero-shot value function learning method that enhances both capabilities via test-time adaptation.<n>We demonstrate that VITA's zero-shot value estimates can be utilized for reward shaping in offline reinforcement learning.
arXiv Detail & Related papers (2025-06-11T18:05:33Z) - Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.