Related papers: A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

URL: http://arxiv.org/abs/2502.08643v2
Date: Tue, 18 Feb 2025 16:45:59 GMT
Title: A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards
Authors: Shivansh Patel, Xinchen Yin, Wenlong Huang, Shubham Garg, Hooshang Nayyeri, Li Fei-Fei, Svetlana Lazebnik, Yunzhu Li,
Abstract summary: We introduce Iterative Keypoint Reward (IKER), a Python-based reward function that serves as a dynamic task specification.<n>We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning policies.<n>The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments.
Score: 29.923942622540356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.

Related papers

T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models [35.83717913117858]
We introduce T-Rex, a Task-Adaptive Framework for Spatial Representation Extraction.<n>We show that our approach delivers significant advantages in spatial understanding, efficiency, and stability without additional training.
arXiv Detail & Related papers (2025-06-24T10:36:15Z)
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models [5.2364456910271935]
Reinforcement Learning (RL) enables agents to autonomously optimize complex behaviors through interaction and reward signals. In this work, we propose an unsupervised pipeline leveraging GPT-4, a pre-trained LLM, to generate reward functions directly from natural language task descriptions. The rewards are used to train RL agents in simulated environments, where we formalize the reward generation process to enhance feasibility.
arXiv Detail & Related papers (2025-03-06T10:08:44Z)
Subtask-Aware Visual Reward Learning from Segmented Demonstrations [97.80917991633248]
This paper introduces REDS: REward learning from Demonstration with Demonstrations, a novel reward learning framework. We train a dense reward function conditioned on video segments and their corresponding subtasks to ensure alignment with ground-truth reward signals. Our experiments show that REDS significantly outperforms baseline methods on complex robotic manipulation tasks in Meta-World.
arXiv Detail & Related papers (2025-02-28T01:25:37Z)
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model [45.03115608632622]
spatial understanding is the keypoint in robot manipulation.<n>We propose SpatialVLA to explore effective spatial representations for the robot foundation model.<n>We show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups.
arXiv Detail & Related papers (2025-01-27T07:34:33Z)
Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce Coarse-to-fine Q-Network with Action Sequence (CQN-AS), a novel value-based reinforcement learning algorithm.<n>We study our algorithm on 53 robotic tasks with sparse and dense rewards, as well as with and without demonstrations.
arXiv Detail & Related papers (2024-11-19T01:23:52Z)
Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features. We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z)
Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies. Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors. We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Affordance-Guided Reinforcement Learning via Visual Prompting [51.361977466993345]
Keypoint-based Affordance Guidance for Improvements (KAGI) is a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL. On real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 20K online fine-tuning steps.
arXiv Detail & Related papers (2024-07-14T21:41:29Z)
Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies [25.760946763103483]
We propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks.<n>Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation.
arXiv Detail & Related papers (2024-06-17T17:00:41Z)
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Gen2Sim: Scaling up Robot Learning in Simulation with Generative Models [17.757495961816783]
Gen2Sim is a method for scaling up robot skill learning in simulation by automating generation of 3D assets, task descriptions, task decompositions and reward functions. Our work contributes hundreds of simulated assets, tasks and demonstrations, taking a step towards fully autonomous robotic manipulation skill acquisition in simulation.
arXiv Detail & Related papers (2023-10-27T17:55:32Z)
Octopus: Embodied Vision-Language Programmer from Environmental Feedback [58.04529328728999]
Embodied vision-language models (VLMs) have achieved substantial progress in multimodal perception and reasoning. To bridge this gap, we introduce Octopus, an embodied vision-language programmer that uses executable code generation as a medium to connect planning and manipulation. Octopus is designed to 1) proficiently comprehend an agent's visual and textual task objectives, 2) formulate intricate action sequences, and 3) generate executable code.
arXiv Detail & Related papers (2023-10-12T17:59:58Z)
End-to-end Reinforcement Learning of Robotic Manipulation with Robust Keypoints Representation [7.374994747693731]
We present an end-to-end Reinforcement Learning framework for robotic manipulation tasks, using a robust and efficient keypoints representation. The proposed method learns keypoints from camera images as the state representation, through a self-supervised autoencoder architecture. We demonstrate the effectiveness of the proposed method on robotic manipulation tasks including grasping and pushing, in different scenarios.
arXiv Detail & Related papers (2022-02-12T09:58:09Z)
Reactive Long Horizon Task Execution via Visual Skill and Precondition Models [59.76233967614774]
We describe an approach for sim-to-real training that can accomplish unseen robotic tasks using models learned in simulation to ground components of a simple task planner. We show an increase in success rate from 91.6% to 98% in simulation and from 10% to 80% success rate in the real-world as compared with naive baselines.
arXiv Detail & Related papers (2020-11-17T15:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.