Value Vision-Language-Action Planning & Search
- URL: http://arxiv.org/abs/2601.00969v1
- Date: Fri, 02 Jan 2026 19:40:34 GMT
- Title: Value Vision-Language-Action Planning & Search
- Authors: Ali Salamatian, Ke, Ren, Kieran Pattison, Cyrus Neary,
- Abstract summary: Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic manipulation.<n>We introduce Value Vision-Language-Action Planning and Search (V-VLAPS), a framework that augments Monte Carlo Tree Search with a lightweight, learnable value function.<n>We evaluate V-VLAPS on the LIBERO robotic manipulation suite, demonstrating that our value-guided search improves success rates by over 5 percentage points.
- Score: 1.631000263754549
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic manipulation, yet they remain fundamentally limited by their reliance on behavior cloning, leading to brittleness under distribution shift. While augmenting pretrained models with test-time search algorithms like Monte Carlo Tree Search (MCTS) can mitigate these failures, existing formulations rely solely on the VLA prior for guidance, lacking a grounded estimate of expected future return. Consequently, when the prior is inaccurate, the planner can only correct action selection via the exploration term, which requires extensive simulation to become effective. To address this limitation, we introduce Value Vision-Language-Action Planning and Search (V-VLAPS), a framework that augments MCTS with a lightweight, learnable value function. By training a simple multilayer perceptron (MLP) on the latent representations of a fixed VLA backbone (Octo), we provide the search with an explicit success signal that biases action selection toward high-value regions. We evaluate V-VLAPS on the LIBERO robotic manipulation suite, demonstrating that our value-guided search improves success rates by over 5 percentage points while reducing the average number of MCTS simulations by 5-15 percent compared to baselines that rely only on the VLA prior.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - Seeing Farther and Smarter: Value-Guided Multi-Path Reflection for VLM Policy Optimization [41.15414881730464]
Vision-Language Models (VLMs) offer a general perceive-reason-act framework for this goal.<n>Previous approaches rely on inefficient and often inaccurate implicit learning of state-values from noisy foresight predictions.<n>We propose a novel test-time computation framework that decouples state evaluation from action generation.
arXiv Detail & Related papers (2026-02-22T22:53:16Z) - TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics [46.912038830356714]
We introduce TOPReward, a novel, probabilistically grounded temporal value function that estimates robotic task progress.<n>In zero-shot evaluations across 130+ distinct real-world tasks, TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL.<n>We demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.
arXiv Detail & Related papers (2026-02-22T19:25:48Z) - Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation [95.89924101984566]
We introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM)<n>GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories.<n>LCM injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory.
arXiv Detail & Related papers (2026-02-22T15:39:34Z) - Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search [7.9342097024286815]
We present Vision-Language-Action Planning & Search (VLAPS)<n>It embeds model-based search into the inference procedure of pre-trained VLA policies.<n>VLAPS significantly outperforms VLA-only baselines on language-specified tasks.
arXiv Detail & Related papers (2025-08-17T02:59:42Z) - VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning [14.099306230721245]
We present VLA-RL, an exploration-based framework that improves on online collected data at test time.<n>We fine-tune a pretrained vision-language model as a robotic process reward model, which is trained on pseudo reward labels annotated on automatically extracted task segments.<n>VLA-RL enables OpenVLA-7B to surpass the strongest finetuned baseline by 4.5% on 40 challenging robotic manipulation tasks in LIBERO.
arXiv Detail & Related papers (2025-05-24T14:42:51Z) - Interactive Post-Training for Vision-Language-Action Models [28.32397816792674]
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based interactive post-training paradigm.<n> RIPT-VLA fine-tunes pretrained Vision-Language-Action (VLA) models using only sparse binary success rewards.<n>With only one demonstration, RIPT-VLA enables an unworkable SFT model to succeed with a 97% success rate within 15 iterations.
arXiv Detail & Related papers (2025-05-22T17:59:45Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success [100.226572152954]
We present an optimized fine-tuning recipe for vision-language-action models (VLAs)<n>Our recipe boosts OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$times$.<n>In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot.
arXiv Detail & Related papers (2025-02-27T00:30:29Z) - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.<n>We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)<n>We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z) - VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems.<n>Current assessment methods primarily rely on AI-annotated preference labels from traditional tasks.<n>We introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z) - VinePPO: Refining Credit Assignment in RL Training of LLMs [66.80143024475635]
We propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates.<n>Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.