Related papers: Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation

Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation

URL: http://arxiv.org/abs/2406.10615v1
Date: Sat, 15 Jun 2024 12:27:35 GMT
Title: Leveraging Locality to Boost Sample Efficiency in Robotic Manipulation
Authors: Tong Zhang, Yingdong Hu, Jiacheng You, Yang Gao,
Abstract summary: SGRv2 is an imitation learning framework that enhances sample efficiency through improved visual and action representations. SGRv2 excels in RLBench tasks with control using merely 5 demonstrations and surpasses the RVT baseline in 23 of 26 tasks.
Score: 14.990771038350106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the high cost of collecting robotic data in the real world, sample efficiency is a consistently compelling pursuit in robotics. In this paper, we introduce SGRv2, an imitation learning framework that enhances sample efficiency through improved visual and action representations. Central to the design of SGRv2 is the incorporation of a critical inductive bias-action locality, which posits that robot's actions are predominantly influenced by the target object and its interactions with the local environment. Extensive experiments in both simulated and real-world settings demonstrate that action locality is essential for boosting sample efficiency. SGRv2 excels in RLBench tasks with keyframe control using merely 5 demonstrations and surpasses the RVT baseline in 23 of 26 tasks. Furthermore, when evaluated on ManiSkill2 and MimicGen using dense control, SGRv2's success rate is 2.54 times that of SGR. In real-world environments, with only eight demonstrations, SGRv2 can perform a variety of tasks at a markedly higher success rate compared to baseline models. Project website: http://sgrv2-robot.github.io

Related papers

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation [35.79160868966466]
We propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning.<n>Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals.<n>We show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.
arXiv Detail & Related papers (2025-05-13T13:20:46Z)
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models. We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z)
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model [45.03115608632622]
spatial understanding is the keypoint in robot manipulation. We propose SpatialVLA to explore effective spatial representations for the robot foundation model. We show the proposed Adaptive Action Grids offer a new and effective way to fine-tune the pre-trained SpatialVLA model for new simulation and real-world setups.
arXiv Detail & Related papers (2025-01-27T07:34:33Z)
Prognostic Framework for Robotic Manipulators Operating Under Dynamic Task Severities [0.6058427379240697]
We present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.
arXiv Detail & Related papers (2024-11-30T17:09:18Z)
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios. We present a new advanced VLA architecture derived from Vision-Language-Models (VLM) We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z)
GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment. We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively. GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z)
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets [24.77850617214567]
We propose a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss.
arXiv Detail & Related papers (2024-10-29T17:58:13Z)
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z)
Affordance-Guided Reinforcement Learning via Visual Prompting [51.361977466993345]
Keypoint-based Affordance Guidance for Improvements (KAGI) is a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL. On real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 20K online fine-tuning steps.
arXiv Detail & Related papers (2024-07-14T21:41:29Z)
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations. A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability. An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z)
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation. We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z)
Learning to navigate efficiently and precisely in real environments [14.52507964172957]
Embodied AI literature focuses on end-to-end agents trained in simulators like Habitat or AI-Thor. In this work we explore end-to-end training of agents in simulation in settings which minimize the sim2real gap.
arXiv Detail & Related papers (2024-01-25T17:50:05Z)
Sample Efficient Robot Learning with Structured World Models [3.1761323820497656]
In game environments, the use of world models has been shown to improve sample efficiency while still achieving good performance. We compare the use of RGB image observation with a feature space leveraging built-in structure, a common approach in robot skill learning, and compare the impact on task performance and learning efficiency with and without the world model.
arXiv Detail & Related papers (2022-10-21T22:08:55Z)
Metric Residual Networks for Sample Efficient Goal-conditioned Reinforcement Learning [52.59242013527014]
Goal-conditioned reinforcement learning (GCRL) has a wide range of potential real-world applications. Sample efficiency is of utmost importance for GCRL since, by default, the agent is only rewarded when it reaches its goal. We introduce a novel neural architecture for GCRL that achieves significantly better sample efficiency than the commonly-used monolithic network architecture.
arXiv Detail & Related papers (2022-08-17T08:04:41Z)
SAGCI-System: Towards Sample-Efficient, Generalizable, Compositional, and Incremental Robot Learning [41.19148076789516]
We introduce a systematic learning framework called SAGCI-system towards achieving the above four requirements. Our system first takes the raw point clouds gathered by the camera mounted on the robot's wrist as the inputs and produces initial modeling of the surrounding environment represented as a URDF. The robot then utilizes the interactive perception to interact with the environments to online verify and modify the URDF.
arXiv Detail & Related papers (2021-11-29T16:53:49Z)
Domain Adaptive Robotic Gesture Recognition with Unsupervised Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot. It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture. Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.