AutoCure: Automated Tabular Data Curation Technique for ML Pipelines
- URL: http://arxiv.org/abs/2304.13636v1
- Date: Wed, 26 Apr 2023 15:51:47 GMT
- Title: AutoCure: Automated Tabular Data Curation Technique for ML Pipelines
- Authors: Mohamed Abdelaal and Rashmi Koparde and Harald Schoening
- Abstract summary: We present AutoCure, a novel and configuration-free data curation pipeline.
Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction.
In practice, AutoCure can be integrated with open source tools to promote the democratization of machine learning.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning algorithms have become increasingly prevalent in multiple
domains, such as autonomous driving, healthcare, and finance. In such domains,
data preparation remains a significant challenge in developing accurate models,
requiring significant expertise and time investment to search the huge search
space of well-suited data curation and transformation tools. To address this
challenge, we present AutoCure, a novel and configuration-free data curation
pipeline that improves the quality of tabular data. Unlike traditional data
curation methods, AutoCure synthetically enhances the density of the clean data
fraction through an adaptive ensemble-based error detection method and a data
augmentation module. In practice, AutoCure can be integrated with open source
tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of
machine learning. As a proof of concept, we provide a comparative evaluation of
AutoCure against 28 combinations of traditional data curation tools,
demonstrating superior performance and predictive accuracy without user
intervention. Our evaluation shows that AutoCure is an effective approach to
automating data preparation and improving the accuracy of machine learning
models.
Related papers
- Hardware Aware Ensemble Selection for Balancing Predictive Accuracy and Cost [0.6486052012623046]
We introduce a hardware-aware ensemble selection approach that integrates inference time into post hoc ensembling.
By leveraging an existing framework for ensemble selection with quality diversity optimization, our method evaluates ensemble candidates for their predictive accuracy and hardware efficiency.
Our evaluation using 83 classification datasets shows that our approach sustains competitive accuracy and can significantly improve ensembles' operational efficiency.
arXiv Detail & Related papers (2024-08-05T07:30:18Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - Automated data processing and feature engineering for deep learning and big data applications: a survey [0.0]
Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data.
Not all data processing tasks in conventional deep learning pipelines have been automated.
arXiv Detail & Related papers (2024-03-18T01:07:48Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z) - AutoDC: Automated data-centric processing [0.2936007114555107]
We develop an automated data-centric tool (AutoDC) to speed up the dataset improvement processes.
AutoDC is estimated to reduce roughly 80% of the manual time for data improvement tasks, at the same time, improve the model accuracy by 10-15% with the fixed ML code.
arXiv Detail & Related papers (2021-11-23T00:48:49Z) - Self-service Data Classification Using Interactive Visualization and
Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm.
IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain.
This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - Fast, Accurate, and Simple Models for Tabular Data via Augmented
Distillation [97.42894942391575]
We propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks.
Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.
arXiv Detail & Related papers (2020-06-25T09:57:47Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.