Related papers: Action-Free Reasoning for Policy Generalization

Action-Free Reasoning for Policy Generalization

URL: http://arxiv.org/abs/2502.03729v2
Date: Tue, 11 Feb 2025 04:51:45 GMT
Title: Action-Free Reasoning for Policy Generalization
Authors: Jaden Clark, Suvir Mirchandani, Dorsa Sadigh, Suneel Belkhale,
Abstract summary: Reasoning through Action-free Data (RAD) learns from both robot demonstration data and action-free human video data.<n>RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data.<n>We will release a new dataset of 3,377 human-hand demonstrations with reasoning annotations compatible with the Bridge V2 benchmark.
Score: 23.34099331171177
License: http://creativecommons.org/licenses/by/4.0/
Abstract: End-to-end imitation learning offers a promising approach for training robot policies. However, generalizing to new settings remains a significant challenge. Although large-scale robot demonstration datasets have shown potential for inducing generalization, they are resource-intensive to scale. In contrast, human video data is abundant and diverse, presenting an attractive alternative. Yet, these human-video datasets lack action labels, complicating their use in imitation learning. Existing methods attempt to extract grounded action representations (e.g., hand poses), but resulting policies struggle to bridge the embodiment gap between human and robot actions. We propose an alternative approach: leveraging language-based reasoning from human videos-essential for guiding robot actions-to train generalizable robot policies. Building on recent advances in reasoning-based policy architectures, we introduce Reasoning through Action-free Data (RAD). RAD learns from both robot demonstration data (with reasoning and action labels) and action-free human video data (with only reasoning labels). The robot data teaches the model to map reasoning to low-level actions, while the action-free data enhances reasoning capabilities. Additionally, we will release a new dataset of 3,377 human-hand demonstrations with reasoning annotations compatible with the Bridge V2 benchmark and aimed at facilitating future research on reasoning-driven robot learning. Our experiments show that RAD enables effective transfer across the embodiment gap, allowing robots to perform tasks seen only in action-free data. Furthermore, scaling up action-free reasoning data significantly improves policy performance and generalization to novel tasks. These results highlight the promise of reasoning-driven learning from action-free datasets for advancing generalizable robot control. Project page: https://rad-generalization.github.io

Related papers

Dexterity from Smart Lenses: Multi-Fingered Robot Manipulation with In-the-Wild Human Demonstrations [52.29884993824894]
Learning multi-fingered robot policies from humans performing daily tasks in natural environments has long been a grand goal in the robotics community.<n>AINA enables learning multi-fingered policies from data collected by anyone, anywhere, and in any environment using Aria Gen 2 glasses.
arXiv Detail & Related papers (2025-11-20T18:59:02Z)
From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment [36.08997778717271]
Learning diverse manipulation skills for real-world robots is bottlenecked by reliance on costly and hard-to-scale teleoperated demonstrations.<n>We introduce Traj2Action, a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation.<n>Our policy first learns to generate a coarse trajectory, which forms a high-level motion plan by leveraging both human and robot data.
arXiv Detail & Related papers (2025-10-01T04:21:12Z)
AR-VRM: Imitating Human Motions for Visual Robot Manipulation with Analogical Reasoning [5.371855090716962]
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations.<n>Existing approaches have employed vision-language pretraining with large-scale data.<n>We propose to learn from large-scale human action video datasets in an explicit way.
arXiv Detail & Related papers (2025-08-11T05:09:58Z)
AMPLIFY: Actionless Motion Priors for Robot Learning from Videos [29.799207502031496]
We introduce AMPLIFY, a novel framework that leverages large-scale video data.<n>We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples.<n>In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data.
arXiv Detail & Related papers (2025-06-17T05:31:42Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal. We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills. We learn our policy to generate appropriate actions given current scene observations and a video of the target task. We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z)
Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks. One of the key contributing factors to this progress is the scale of robot data used to train the models. We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z)
Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works. However, learning a model that captures the dynamics of complex skills represents a major challenge. We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.