Related papers: Stillleben: Realistic Scene Synthesis for Deep Learning in Robotics

Stillleben: Realistic Scene Synthesis for Deep Learning in Robotics

URL: http://arxiv.org/abs/2005.05659v1
Date: Tue, 12 May 2020 10:11:00 GMT
Title: Stillleben: Realistic Scene Synthesis for Deep Learning in Robotics
Authors: Max Schwarz and Sven Behnke
Abstract summary: We describe a synthesis pipeline capable of producing training data for cluttered scene perception tasks. Our approach arranges object meshes in physically realistic, dense scenes using physics simulation. Our pipeline can be run online during training of a deep neural network.
Score: 33.30312206728974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training data is the key ingredient for deep learning approaches, but difficult to obtain for the specialized domains often encountered in robotics. We describe a synthesis pipeline capable of producing training data for cluttered scene perception tasks such as semantic segmentation, object detection, and correspondence or pose estimation. Our approach arranges object meshes in physically realistic, dense scenes using physics simulation. The arranged scenes are rendered using high-quality rasterization with randomized appearance and material parameters. Noise and other transformations introduced by the camera sensors are simulated. Our pipeline can be run online during training of a deep neural network, yielding applications in life-long learning and in iterative render-and-compare approaches. We demonstrate the usability by learning semantic segmentation on the challenging YCB-Video dataset without actually using any training frames, where our method achieves performance comparable to a conventionally trained model. Additionally, we show successful application in a real-world regrasping system.

Related papers

Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z)
Watch and Learn: Learning to Use Computers from Online Videos [50.10702690339142]
Watch & Learn (W&L) is a framework that converts human demonstration videos readily available on the Internet into executable UI trajectories at scale.<n>We develop an inverse dynamics labeling pipeline with task-aware video retrieval, generate over 53k high-quality trajectories from raw web videos.<n>These results highlight web-scale human demonstration videos as a practical and scalable foundation for advancing CUAs towards real-world deployment.
arXiv Detail & Related papers (2025-10-06T10:29:00Z)
Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation [0.6533458718563319]
This paper presents a framework designed to enhance the grasping capabilities of quadrupeds equipped with arms.<n>We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects.<n>This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras.<n>We validated the complete framework on a four-legged robot.
arXiv Detail & Related papers (2025-08-24T17:47:56Z)
URDFormer: A Pipeline for Constructing Articulated Simulation Environments from Real-World Images [39.0780707100513]
We present an integrated end-to-end pipeline that generates simulation scenes complete with articulated kinematic and dynamic structures from real-world images. In doing so, our work provides both a pipeline for large-scale generation of simulation environments and an integrated system for training robust robotic control policies.
arXiv Detail & Related papers (2024-05-19T20:01:29Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
DNS SLAM: Dense Neural Semantic-Informed SLAM [92.39687553022605]
DNS SLAM is a novel neural RGB-D semantic SLAM approach featuring a hybrid representation. Our method integrates multi-view geometry constraints with image-based feature extraction to improve appearance details. Our experimental results achieve state-of-the-art performance on both synthetic data and real-world data tracking.
arXiv Detail & Related papers (2023-11-30T21:34:44Z)
Language-guided Robot Grasping: CLIP-based Referring Grasp Synthesis in Clutter [14.489086924126253]
This work focuses on the task of referring grasp synthesis, which predicts a grasp pose for an object referred through natural language in cluttered scenes. Existing approaches often employ multi-stage pipelines that first segment the referred object and then propose a suitable grasp, and are evaluated in private datasets or simulators that do not capture the complexity of natural indoor scenes. We propose a novel end-to-end model (CROG) that leverages the visual grounding capabilities of CLIP to learn synthesis grasp directly from image-text pairs.
arXiv Detail & Related papers (2023-11-09T22:55:10Z)
Leveraging Deepfakes to Close the Domain Gap between Real and Synthetic Images in Facial Capture Pipelines [8.366597450893456]
We propose an end-to-end pipeline for building and tracking 3D facial models from personalized in-the-wild video data. We present a method for automatic data curation and retrieval based on a hierarchical clustering framework typical of collision algorithms in traditional computer graphics pipelines. We outline how we train a motion capture regressor, leveraging the aforementioned techniques to avoid the need for real-world ground truth data.
arXiv Detail & Related papers (2022-04-22T15:09:49Z)
Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data. We first train a scale-aware disparity network using both monocular real images and stereo virtual data. The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z)
Evaluating Continual Learning Algorithms by Generating 3D Virtual Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment. We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance. A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z)
Scene Synthesis via Uncertainty-Driven Attribute Synchronization [52.31834816911887]
This paper introduces a novel neural scene synthesis approach that can capture diverse feature patterns of 3D scenes. Our method combines the strength of both neural network-based and conventional scene synthesis approaches.
arXiv Detail & Related papers (2021-08-30T19:45:07Z)
PennSyn2Real: Training Object Recognition Models without Human Labeling [12.923677573437699]
We propose PennSyn2Real - a synthetic dataset consisting of more than 100,000 4K images of more than 20 types of micro aerial vehicles (MAVs) The dataset can be used to generate arbitrary numbers of training images for high-level computer vision tasks such as MAV detection and classification. We show that synthetic data generated using this framework can be directly used to train CNN models for common object recognition tasks such as detection and segmentation.
arXiv Detail & Related papers (2020-09-22T02:53:40Z)
Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation [88.04759848307687]
In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature. We use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training. We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods.
arXiv Detail & Related papers (2020-08-20T17:28:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.