Related papers: ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation

URL: http://arxiv.org/abs/2509.19454v1
Date: Tue, 23 Sep 2025 18:11:53 GMT
Title: ROPA: Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation
Authors: Jason Chen, I-Chun Arthur Liu, Gaurav Sukhatme, Daniel Seita,
Abstract summary: We propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA)<n>ROPA fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses.<n>Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations.
Score: 3.1921574296387916
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training robust bimanual manipulation policies via imitation learning requires demonstration data with broad coverage over robot poses, contacts, and scene contexts. However, collecting diverse and precise real-world demonstrations is costly and time-consuming, which hinders scalability. Prior works have addressed this with data augmentation, typically for either eye-in-hand (wrist camera) setups with RGB inputs or for generating novel images without paired actions, leaving augmentation for eye-to-hand (third-person) RGB-D training with new action labels less explored. In this paper, we propose Synthetic Robot Pose Generation for RGB-D Bimanual Data Augmentation (ROPA), an offline imitation learning data augmentation method that fine-tunes Stable Diffusion to synthesize third-person RGB and RGB-D observations of novel robot poses. Our approach simultaneously generates corresponding joint-space action labels while employing constrained optimization to enforce physical consistency through appropriate gripper-to-object contact constraints in bimanual scenarios. We evaluate our method on 5 simulated and 3 real-world tasks. Our results across 2625 simulation trials and 300 real-world trials demonstrate that ROPA outperforms baselines and ablations, showing its potential for scalable RGB and RGB-D data augmentation in eye-to-hand bimanual manipulation. Our project website is available at: https://ropaaug.github.io/.

Related papers

Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization [0.8839687029212673]
Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction.<n>We present a framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video.
arXiv Detail & Related papers (2025-12-18T08:44:22Z)
R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation [74.41728218960465]
We propose a real-to-real 3D data generation framework (R2RGen) that directly augments the pointcloud observation-action pairs to generate real-world data.<n>R2RGen substantially enhances data efficiency on extensive experiments and demonstrates strong potential for scaling and application on mobile manipulation.
arXiv Detail & Related papers (2025-10-09T17:55:44Z)
MINT-RVAE: Multi-Cues Intention Prediction of Human-Robot Interaction using Human Pose and Emotion Information from RGB-only Camera Data [0.8839687029212673]
We propose a novel pipeline for predicting human interaction intent with frame-level precision.<n>A key challenge in intent prediction is the class imbalance inherent in real-world HRI datasets.<n>Our approach achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-26T16:49:40Z)
RoboPearls: Editable Video Simulation for Robot Manipulation [81.18434338506621]
RoboPearls is an editable video simulation framework for robotic manipulation.<n>Built on 3D Gaussian Splatting (3DGS), RoboPearls enables the construction of photo-realistic, view-consistent simulations.<n>We conduct extensive experiments on multiple datasets and scenes, including RLBench, COLOSSEUM, Ego4D, Open X-Embodiment, and a real-world robot.
arXiv Detail & Related papers (2025-06-28T05:03:31Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
SparseGrasp: Robotic Grasping via 3D Semantic Gaussian Splatting from Sparse Multi-View RGB Images [125.66499135980344]
We propose SparseGrasp, a novel open-vocabulary robotic grasping system.<n>SparseGrasp operates efficiently with sparse-view RGB images and handles scene updates fastly.<n>We show that SparseGrasp significantly outperforms state-of-the-art methods in terms of both speed and adaptability.
arXiv Detail & Related papers (2024-12-03T03:56:01Z)
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation [31.211870350260703]
Keypoint Constraints (ReKep) is a visually-grounded representation for constraints in robotic manipulation. ReKep is expressed as Python functions mapping a set of 3D keypoints the environment to a numerical cost. We present system implementations on a wheeled single-arm platform and a stationary dual-arm platform.
arXiv Detail & Related papers (2024-09-03T06:45:22Z)
Render and Diffuse: Aligning Image and Action Spaces for Diffusion-based Behaviour Cloning [15.266994159289645]
We introduce Render and Diffuse (R&D) a method that unifies low-level robot actions and RGB observations within the image space using virtual renders of the 3D model of the robot. This space unification simplifies the learning problem and introduces inductive biases that are crucial for sample efficiency and spatial generalisation. Our results show that R&D exhibits strong spatial generalisation capabilities and is more sample efficient than more common image-to-action methods.
arXiv Detail & Related papers (2024-05-28T14:06:10Z)
NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis [50.93065653283523]
SPARTN (Synthetic Perturbations for Augmenting Robot Trajectories via NeRF) is a fully-offline data augmentation scheme for improving robot policies. Our approach leverages neural radiance fields (NeRFs) to synthetically inject corrective noise into visual demonstrations. In a simulated 6-DoF visual grasping benchmark, SPARTN improves success rates by 2.8$times$ over imitation learning without the corrective augmentations.
arXiv Detail & Related papers (2023-01-18T23:25:27Z)
A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition. We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z)
Self-Supervised Motion Retargeting with Safety Guarantee [12.325683599398564]
We present a data-driven motion method that enables the generation of natural motions in humanoid robots from motion capture data or RGB videos. Our method can generate expressive robotic motions from both the CMU motion capture database and YouTube videos.
arXiv Detail & Related papers (2021-03-11T04:17:26Z)
Unseen Object Instance Segmentation for Robotic Environments [67.88276573341734]
We propose a method to segment unseen object instances in tabletop environments. UOIS-Net is comprised of two stages: first, it operates only on depth to produce object instance center votes in 2D or 3D. Surprisingly, our framework is able to learn from synthetic RGB-D data where the RGB is non-photorealistic.
arXiv Detail & Related papers (2020-07-16T01:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.