Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation
- URL: http://arxiv.org/abs/2509.19524v1
- Date: Tue, 23 Sep 2025 19:42:14 GMT
- Title: Score the Steps, Not Just the Goal: VLM-Based Subgoal Evaluation for Robotic Manipulation
- Authors: Ramy ElMallah, Krish Chhajer, Chi-Guhn Lee,
- Abstract summary: We propose a blueprint for StepEval, a cost-aware plug-in evaluation framework.<n>Our contribution is to outline design principles for a scalable, community-driven open-source project.
- Score: 6.2511886555343805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robot learning papers typically report a single binary success rate (SR), which obscures where a policy succeeds or fails along a multi-step manipulation task. We argue that subgoal-level reporting should become routine: for each trajectory, a vector of per-subgoal SRs that makes partial competence visible (e.g., grasp vs. pour). We propose a blueprint for StepEval, a cost-aware plug-in evaluation framework that utilizes vision-language models (VLMs) as automated judges of subgoal outcomes from recorded images or videos. Rather than proposing new benchmarks or APIs, our contribution is to outline design principles for a scalable, community-driven open-source project. In StepEval, the primary artifact for policy evaluation is the per-subgoal SR vector; however, other quantities (e.g., latency or cost estimates) are also considered for framework-optimization diagnostics to help the community tune evaluation efficiency and accuracy when ground-truth subgoal success labels are available. We discuss how such a framework can remain model-agnostic, support single- or multi-view inputs, and be lightweight enough to adopt across labs. The intended contribution is a shared direction: a minimal, extensible seed that invites open-source contributions, so that scoring the steps, not just the final goal, becomes a standard and reproducible practice.
Related papers
- Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning [0.0]
Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning.<n>LLMs produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment.<n>LLMs planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals.
arXiv Detail & Related papers (2025-11-26T02:49:44Z) - Agentic Reinforcement Learning with Implicit Step Rewards [92.26560379363492]
Large language models (LLMs) are increasingly developed as autonomous agents using reinforcement learning (agentic RL)<n>We introduce implicit step rewards for agentic RL (iStar), a general credit-assignment strategy that integrates seamlessly with standard RL algorithms.<n>We evaluate our method on three challenging agent benchmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverifiable rewards in SOTOPIA.
arXiv Detail & Related papers (2025-09-23T16:15:42Z) - SCAN: Structured Capability Assessment and Navigation for LLMs [54.54085382131134]
textbfSCAN (Structured Capability Assessment and Navigation) is a practical framework that enables detailed characterization of Large Language Models.<n>SCAN incorporates four key components:.<n>TaxBuilder, which extracts capability-indicating tags from queries to construct a hierarchical taxonomy;.<n>RealMix, a query synthesis and filtering mechanism that ensures sufficient evaluation data for each capability tag;.<n>A PC$2$-based (Pre-Comparison-derived Criteria) LLM-as-a-Judge approach achieves significantly higher accuracy compared to classic LLM-as-a-Judge method
arXiv Detail & Related papers (2025-05-10T16:52:40Z) - Probabilistic Subgoal Representations for Hierarchical Reinforcement learning [16.756888009396462]
In goal-conditioned hierarchical reinforcement learning, a high-level policy specifies a subgoal for the low-level policy to reach.
Existing methods adopt a subgoal representation that provides a deterministic mapping from state space to latent subgoal space.
This paper employs a GP prior on the latent subgoal space to learn a posterior distribution over the subgoal representation functions.
arXiv Detail & Related papers (2024-06-24T15:09:22Z) - HIQL: Offline Goal-Conditioned RL with Latent States as Actions [81.67963770528753]
We propose a hierarchical algorithm for goal-conditioned RL from offline data.
We show how this hierarchical decomposition makes our method robust to noise in the estimated value function.
Our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data.
arXiv Detail & Related papers (2023-07-22T00:17:36Z) - Take a Break in the Middle: Investigating Subgoals towards Hierarchical
Script Generation [41.79944184861954]
Goal-oriented Script Generation is a new task of generating a list of steps that can fulfill the given goal.
In this paper, we propose to extend the task from the perspective of cognitive theory.
arXiv Detail & Related papers (2023-05-18T12:10:06Z) - Learning Rational Subgoals from Demonstrations and Instructions [71.86713748450363]
We present a framework for learning useful subgoals that support efficient long-term planning to achieve novel goals.
At the core of our framework is a collection of rational subgoals (RSGs), which are essentially binary classifiers over the environmental states.
Given a goal description, the learned subgoals and the derived dependencies facilitate off-the-shelf planning algorithms, such as A* and RRT.
arXiv Detail & Related papers (2023-03-09T18:39:22Z) - Goal Recognition as a Deep Learning Task: the GRNet Approach [0.0]
In automated planning, recognising the goal of an agent from a trace of observations is an important task with many applications.
We study an alternative approach where goal recognition is formulated as a classification task addressed by machine learning.
Our approach, called GRNet, is primarily aimed at making goal recognition more accurate as well as faster by learning how to solve it in a given domain.
arXiv Detail & Related papers (2022-10-05T16:42:48Z) - Goal-Conditioned Q-Learning as Knowledge Distillation [136.79415677706612]
We explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation.
We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional.
We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals.
arXiv Detail & Related papers (2022-08-28T22:01:10Z) - Adaptive Multi-Goal Exploration [118.40427257364729]
We show how AdaGoal can be used to tackle the objective of learning an $epsilon$-optimal goal-conditioned policy.
AdaGoal is anchored in the high-level algorithmic structure of existing methods for goal-conditioned deep reinforcement learning.
arXiv Detail & Related papers (2021-11-23T17:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.