Dataset Factory: A Toolchain For Generative Computer Vision Datasets
- URL: http://arxiv.org/abs/2309.11608v1
- Date: Wed, 20 Sep 2023 19:43:37 GMT
- Title: Dataset Factory: A Toolchain For Generative Computer Vision Datasets
- Authors: Daniel Kharitonov and Ryan Turner
- Abstract summary: We propose a "dataset factory" that separates the storage and processing of samples from metadata.
This enables data-centric operations at scale for machine learning teams and individual researchers.
- Score: 0.9013233848500058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative AI workflows heavily rely on data-centric tasks - such as
filtering samples by annotation fields, vector distances, or scores produced by
custom classifiers. At the same time, computer vision datasets are quickly
approaching petabyte volumes, rendering data wrangling difficult. In addition,
the iterative nature of data preparation necessitates robust dataset sharing
and versioning mechanisms, both of which are hard to implement ad-hoc. To solve
these challenges, we propose a "dataset factory" approach that separates the
storage and processing of samples from metadata and enables data-centric
operations at scale for machine learning teams and individual researchers.
Related papers
- Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach [36.47860223750303]
We consider the problem of automatic curation of high-quality datasets for self-supervised pre-training.
We propose a clustering-based approach for building ones satisfying all these criteria.
Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository.
arXiv Detail & Related papers (2024-05-24T14:58:51Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - A Configurable Library for Generating and Manipulating Maze Datasets [0.9268994664916388]
Mazes serve as an excellent testbed due to varied generation algorithms.
We present $textttmaze-dataset$, a comprehensive library for generating, processing, and visualizing datasets consisting of maze-solving tasks.
arXiv Detail & Related papers (2023-09-19T10:20:11Z) - DataAssist: A Machine Learning Approach to Data Cleaning and Preparation [0.0]
DataAssist is an automated data preparation and cleaning platform that enhances dataset quality using ML-informed methods.
Our tool is applicable to a variety of fields, including economics, business, and forecasting applications saving over 50% time of the time spent on data cleansing and preparation.
arXiv Detail & Related papers (2023-07-14T01:50:53Z) - Fingerprinting and Building Large Reproducible Datasets [3.2873782624127843]
We propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their provenance.
We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted.
arXiv Detail & Related papers (2023-06-20T08:59:33Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - Designing Data: Proactive Data Collection and Iteration for Machine
Learning [12.295169687537395]
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications.
New methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability.
arXiv Detail & Related papers (2023-01-24T21:40:29Z) - Privacy-Preserving Machine Learning for Collaborative Data Sharing via
Auto-encoder Latent Space Embeddings [57.45332961252628]
Privacy-preserving machine learning in data-sharing processes is an ever-critical task.
This paper presents an innovative framework that uses Representation Learning via autoencoders to generate privacy-preserving embedded data.
arXiv Detail & Related papers (2022-11-10T17:36:58Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.