Related papers: DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model

URL: http://arxiv.org/abs/2405.07309v1
Date: Sun, 12 May 2024 15:38:17 GMT
Title: DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model
Authors: Yang Jin, Jun Lv, Shuqiang Jiang, Cewu Lu,
Abstract summary: DiffGen is a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model. It can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.
Score: 72.66465487508556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating robot demonstrations through simulation is widely recognized as an effective way to scale up robot data. Previous work often trained reinforcement learning agents to generate expert policies, but this approach lacks sample efficiency. Recently, a line of work has attempted to generate robot demonstrations via differentiable simulation, which is promising but heavily relies on reward design, a labor-intensive process. In this paper, we propose DiffGen, a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model to enable automatic and efficient generation of robot demonstrations. Given a simulated robot manipulation scenario and a natural language instruction, DiffGen can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation after manipulation. The embeddings are obtained from the vision-language model, and the optimization is achieved by calculating and descending gradients through the differentiable simulation, differentiable rendering, and vision-language model components, thereby accomplishing the specified task. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.

Related papers

RoboPearls: Editable Video Simulation for Robot Manipulation [81.18434338506621]
RoboPearls is an editable video simulation framework for robotic manipulation.<n>Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations.<n>We conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot.
arXiv Detail & Related papers (2025-06-28T05:03:31Z)
Sample Efficient Robot Learning in Supervised Effect Prediction Tasks [0.0]
In this work, we develop a novel AL framework geared towards robotics regression tasks, such as action-effect prediction and, more generally, for world model learning, which we call MUSEL. MUSEL aims to extract model uncertainty from the total uncertainty estimate given by a suitable learning engine by making use of earning progress and input diversity and use it to improve sample efficiency beyond the state-of-the-art action-effect prediction methods. The efficacy of MUSEL is demonstrated by comparing its performance to standard methods used in robot action-effect learning.
arXiv Detail & Related papers (2024-12-03T09:48:28Z)
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model. Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z)
Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation. We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%. Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z)
Learning Quadruped Locomotion Using Differentiable Simulation [31.80380408663424]
Differentiable simulation promises fast convergence and stable training. This work proposes a new differentiable simulation framework to overcome these challenges. Our framework enables learning quadruped walking in simulation in minutes without parallelization.
arXiv Detail & Related papers (2024-03-21T22:18:59Z)
DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models [102.13968267347553]
We present DiffuseBot, a physics-augmented diffusion model that generates soft robot morphologies capable of excelling in a wide spectrum of tasks. We showcase a range of simulated and fabricated robots along with their capabilities.
arXiv Detail & Related papers (2023-11-28T18:58:48Z)
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations [55.549956643032836]
MimicGen is a system for automatically synthesizing large-scale, rich datasets from only a small number of human demonstrations. We show that robot agents can be effectively trained on this generated dataset by imitation learning to achieve strong performance in long-horizon and high-precision tasks.
arXiv Detail & Related papers (2023-10-26T17:17:31Z)
Exploring Visual Pre-training for Robot Manipulation: Datasets, Models and Methods [14.780597545674157]
We investigate the effects of visual pre-training strategies on robot manipulation tasks from three fundamental perspectives. We propose a visual pre-training scheme for robot manipulation termed Vi-PRoM, which combines self-supervised learning and supervised learning.
arXiv Detail & Related papers (2023-08-07T14:24:52Z)
Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z)
Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement Learning [54.636562516974884]
In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on. In this work, we propose MEDAL++, a novel design for self-improving robotic systems. The robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations.
arXiv Detail & Related papers (2023-03-02T18:51:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.