Related papers: Kubric: A scalable dataset generator

Kubric: A scalable dataset generator

URL: http://arxiv.org/abs/2203.03570v1
Date: Mon, 7 Mar 2022 18:13:59 GMT
Title: Kubric: A scalable dataset generator
Authors: Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti (Derek) Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
Abstract summary: Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
Score: 73.78485189435729
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details. But collecting, processing and annotating real data at scale is difficult, expensive, and frequently raises additional privacy, fairness and legal concerns. Synthetic data is a powerful tool with the potential to address these shortcomings: 1) it is cheap 2) supports rich ground-truth annotations 3) offers full control over data and 4) can circumvent or mitigate problems regarding bias, privacy and licensing. Unfortunately, software tools for effective data generation are less mature than those for architecture design and training, which leads to fragmented generation efforts. To address these problems we introduce Kubric, an open-source Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines, and generating TBs of data. We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation. We release Kubric, the used assets, all of the generation code, as well as the rendered datasets for reuse and modification.

Related papers

Anymate: A Dataset and Baselines for Learning 3D Object Rigging [18.973312365787137]
We present a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information.<n>We propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction.<n>Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning.
arXiv Detail & Related papers (2025-05-09T17:59:33Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for Foundation Models [64.28420991770382]
We present Data-Juicer 2.0, a new system offering fruitful data processing capabilities backed by over a hundred operators. The system is publicly available, actively maintained, and broadly adopted in diverse research endeavors, practical applications, and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Dataset Growth [59.68869191071907]
InfoGrowth is an efficient online algorithm for data cleaning and selection. It can improve data quality/efficiency on both single-modal and multi-modal tasks.
arXiv Detail & Related papers (2024-05-28T16:43:57Z)
AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience. We develop the tasks involved in dataset development and offer insights into their effective management. Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z)
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [87.61900472933523]
This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. We scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos.
arXiv Detail & Related papers (2024-01-19T18:59:52Z)
Solving Data Quality Problems with Desbordante: a Demo [35.75243108496634]
Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. In this demonstration, we show several scenarios that allow end users to solve different data quality problems.
arXiv Detail & Related papers (2023-07-27T15:26:26Z)
Data-Copilot: Bridging Billions of Data and Humans with Autonomous Workflow [49.724842920942024]
Industries such as finance, meteorology, and energy generate vast amounts of data daily. We propose Data-Copilot, a data analysis agent that autonomously performs querying, processing, and visualization of massive data tailored to diverse human requests.
arXiv Detail & Related papers (2023-06-12T16:12:56Z)
A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z)
WorldGen: A Large Scale Generative Simulator [12.886022807173337]
We present WorldGen, an open source framework to autonomously generate countless structured and unstructured 3D photorealistic scenes. WorldGen gives the user full access and control to features such as texture, object structure, motion, camera and lens properties for better generalizability.
arXiv Detail & Related papers (2022-10-03T05:07:42Z)
REGRAD: A Large-Scale Relational Grasp Dataset for Safe and Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps. Our dataset is collected in both forms of 2D images and 3D point clouds. Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
UnrealROX+: An Improved Tool for Acquiring Synthetic Data from Virtual 3D Environments [14.453602631430508]
We present an improved version of UnrealROX, a tool to generate synthetic data from robotic images. Un UnrealROX+ includes new features such as generating albedo or a Python API for interacting with the virtual environment from Deep Learning frameworks.
arXiv Detail & Related papers (2021-04-23T18:45:42Z)
Laplacian Denoising Autoencoder [114.21219514831343]
We propose to learn data representations with a novel type of denoising autoencoder. The noisy input data is generated by corrupting latent clean data in the gradient domain. Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach.
arXiv Detail & Related papers (2020-03-30T16:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.