AutoDC: Automated data-centric processing
- URL: http://arxiv.org/abs/2111.12548v1
- Date: Tue, 23 Nov 2021 00:48:49 GMT
- Title: AutoDC: Automated data-centric processing
- Authors: Zac Yung-Chun Liu, Shoumik Roychowdhury, Scott Tarlow, Akash Nair,
Shweta Badhe, Tejas Shah
- Abstract summary: We develop an automated data-centric tool (AutoDC) to speed up the dataset improvement processes.
AutoDC is estimated to reduce roughly 80% of the manual time for data improvement tasks, at the same time, improve the model accuracy by 10-15% with the fixed ML code.
- Score: 0.2936007114555107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AutoML (automated machine learning) has been extensively developed in the
past few years for the model-centric approach. As for the data-centric
approach, the processes to improve the dataset, such as fixing incorrect
labels, adding examples that represent edge cases, and applying data
augmentation, are still very artisanal and expensive. Here we develop an
automated data-centric tool (AutoDC), similar to the purpose of AutoML, aims to
speed up the dataset improvement processes. In our preliminary tests on 3 open
source image classification datasets, AutoDC is estimated to reduce roughly 80%
of the manual time for data improvement tasks, at the same time, improve the
model accuracy by 10-15% with the fixed ML code.
Related papers
- AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - Automated data processing and feature engineering for deep learning and big data applications: a survey [0.0]
Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data.
Not all data processing tasks in conventional deep learning pipelines have been automated.
arXiv Detail & Related papers (2024-03-18T01:07:48Z) - Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z) - AutoCure: Automated Tabular Data Curation Technique for ML Pipelines [0.0]
We present AutoCure, a novel and configuration-free data curation pipeline.
Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction.
In practice, AutoCure can be integrated with open source tools to promote the democratization of machine learning.
arXiv Detail & Related papers (2023-04-26T15:51:47Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - AutoFlow: Learning a Better Training Set for Optical Flow [62.40293188964933]
AutoFlow is a method to render training data for optical flow.
AutoFlow achieves state-of-the-art accuracy in pre-training both PWC-Net and RAFT.
arXiv Detail & Related papers (2021-04-29T17:55:23Z) - AutoDO: Robust AutoAugment for Biased Data with Label Noise via Scalable
Probabilistic Implicit Differentiation [3.118384520557952]
AutoAugment has sparked an interest in automated augmentation methods for deep learning models.
We show that those methods are not robust when applied to biased and noisy data.
We reformulate AutoAugment as a generalized automated dataset optimization (AutoDO) task.
Our experiments show up to 9.3% improvement for biased datasets with label noise compared to prior methods.
arXiv Detail & Related papers (2021-03-10T04:05:33Z) - Adaptive Weighting Scheme for Automatic Time-Series Data Augmentation [79.47771259100674]
We present two sample-adaptive automatic weighting schemes for data augmentation.
We validate our proposed methods on a large, noisy financial dataset and on time-series datasets from the UCR archive.
On the financial dataset, we show that the methods in combination with a trading strategy lead to improvements in annualized returns of over 50$%$, and on the time-series data we outperform state-of-the-art models on over half of the datasets, and achieve similar performance in accuracy on the others.
arXiv Detail & Related papers (2021-02-16T17:50:51Z) - Fast, Accurate, and Simple Models for Tabular Data via Augmented
Distillation [97.42894942391575]
We propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks.
Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.
arXiv Detail & Related papers (2020-06-25T09:57:47Z) - Adaptation Strategies for Automated Machine Learning on Evolving Data [7.843067454030999]
This study is to understand the effect of data stream challenges such as concept drift on the performance of AutoML methods.
We propose 6 concept drift adaptation strategies and evaluate their effectiveness on different AutoML approaches.
arXiv Detail & Related papers (2020-06-09T14:29:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.