Kubric: A scalable dataset generator
- URL: http://arxiv.org/abs/2203.03570v1
- Date: Mon, 7 Mar 2022 18:13:59 GMT
- Title: Kubric: A scalable dataset generator
- Authors: Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du,
Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles
Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti
(Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz
Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M.
Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora,
Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
- Abstract summary: Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
- Score: 73.78485189435729
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data is the driving force of machine learning, with the amount and quality of
training data often being more important for the performance of a system than
architecture and training details. But collecting, processing and annotating
real data at scale is difficult, expensive, and frequently raises additional
privacy, fairness and legal concerns. Synthetic data is a powerful tool with
the potential to address these shortcomings: 1) it is cheap 2) supports rich
ground-truth annotations 3) offers full control over data and 4) can circumvent
or mitigate problems regarding bias, privacy and licensing. Unfortunately,
software tools for effective data generation are less mature than those for
architecture design and training, which leads to fragmented generation efforts.
To address these problems we introduce Kubric, an open-source Python framework
that interfaces with PyBullet and Blender to generate photo-realistic scenes,
with rich annotations, and seamlessly scales to large jobs distributed over
thousands of machines, and generating TBs of data. We demonstrate the
effectiveness of Kubric by presenting a series of 13 different generated
datasets for tasks ranging from studying 3D NeRF models to optical flow
estimation. We release Kubric, the used assets, all of the generation code, as
well as the rendered datasets for reuse and modification.
Related papers
- Dataset Growth [59.68869191071907]
InfoGrowth is an efficient online algorithm for data cleaning and selection.
It can improve data quality/efficiency on both single-modal and multi-modal tasks.
arXiv Detail & Related papers (2024-05-28T16:43:57Z) - AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [87.61900472933523]
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation.
We scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data.
We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos.
arXiv Detail & Related papers (2024-01-19T18:59:52Z) - Solving Data Quality Problems with Desbordante: a Demo [35.75243108496634]
Desbordante is an open-source data profiler that aims to close this gap.
It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations.
In this demonstration, we show several scenarios that allow end users to solve different data quality problems.
arXiv Detail & Related papers (2023-07-27T15:26:26Z) - Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily.
We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z) - A New Benchmark: On the Utility of Synthetic Data with Blender for Bare
Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data.
The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist.
To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z) - WorldGen: A Large Scale Generative Simulator [12.886022807173337]
We present WorldGen, an open source framework to autonomously generate countless structured and unstructured 3D photorealistic scenes.
WorldGen gives the user full access and control to features such as texture, object structure, motion, camera and lens properties for better generalizability.
arXiv Detail & Related papers (2022-10-03T05:07:42Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z) - UnrealROX+: An Improved Tool for Acquiring Synthetic Data from Virtual
3D Environments [14.453602631430508]
We present an improved version of UnrealROX, a tool to generate synthetic data from robotic images.
Un UnrealROX+ includes new features such as generating albedo or a Python API for interacting with the virtual environment from Deep Learning frameworks.
arXiv Detail & Related papers (2021-04-23T18:45:42Z) - Laplacian Denoising Autoencoder [114.21219514831343]
We propose to learn data representations with a novel type of denoising autoencoder.
The noisy input data is generated by corrupting latent clean data in the gradient domain.
Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach.
arXiv Detail & Related papers (2020-03-30T16:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.