Related papers: AutoCure: Automated Tabular Data Curation Technique for ML Pipelines

AutoCure: Automated Tabular Data Curation Technique for ML Pipelines

URL: http://arxiv.org/abs/2304.13636v1
Date: Wed, 26 Apr 2023 15:51:47 GMT
Title: AutoCure: Automated Tabular Data Curation Technique for ML Pipelines
Authors: Mohamed Abdelaal and Rashmi Koparde and Harald Schoening
Abstract summary: We present AutoCure, a novel and configuration-free data curation pipeline. Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction. In practice, AutoCure can be integrated with open source tools to promote the democratization of machine learning.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning algorithms have become increasingly prevalent in multiple domains, such as autonomous driving, healthcare, and finance. In such domains, data preparation remains a significant challenge in developing accurate models, requiring significant expertise and time investment to search the huge search space of well-suited data curation and transformation tools. To address this challenge, we present AutoCure, a novel and configuration-free data curation pipeline that improves the quality of tabular data. Unlike traditional data curation methods, AutoCure synthetically enhances the density of the clean data fraction through an adaptive ensemble-based error detection method and a data augmentation module. In practice, AutoCure can be integrated with open source tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of machine learning. As a proof of concept, we provide a comparative evaluation of AutoCure against 28 combinations of traditional data curation tools, demonstrating superior performance and predictive accuracy without user intervention. Our evaluation shows that AutoCure is an effective approach to automating data preparation and improving the accuracy of machine learning models.

Related papers

An Automated Data Mining Framework Using Autoencoders for Feature Extraction and Dimensionality Reduction [10.358417199718462]
This study proposes an automated data mining framework based on autoencoders and experimentally verifies its effectiveness in feature extraction and data dimensionality reduction. Through the encoding-decoding structure, the autoencoder can capture the data's potential characteristics and achieve noise reduction and anomaly detection. In the future, with the advancement of deep learning and big data technology, the autoencoder method combined with a generative adversarial network (GAN) or graph neural network (GNN) is expected to be more widely used in the fields of complex data processing, real-time data analysis and intelligent decision-making.
arXiv Detail & Related papers (2024-12-03T07:04:10Z)
Hardware Aware Ensemble Selection for Balancing Predictive Accuracy and Cost [0.6486052012623046]
We introduce a hardware-aware ensemble selection approach that integrates inference time into post hoc ensembling. By leveraging an existing framework for ensemble selection with quality diversity optimization, our method evaluates ensemble candidates for their predictive accuracy and hardware efficiency. Our evaluation using 83 classification datasets shows that our approach sustains competitive accuracy and can significantly improve ensembles' operational efficiency.
arXiv Detail & Related papers (2024-08-05T07:30:18Z)
Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z)
AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z)
Automated data processing and feature engineering for deep learning and big data applications: a survey [0.0]
Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data. Not all data processing tasks in conventional deep learning pipelines have been automated.
arXiv Detail & Related papers (2024-03-18T01:07:48Z)
Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models. The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z)
AutoDC: Automated data-centric processing [0.2936007114555107]
We develop an automated data-centric tool (AutoDC) to speed up the dataset improvement processes. AutoDC is estimated to reduce roughly 80% of the manual time for data improvement tasks, at the same time, improve the model accuracy by 10-15% with the fixed ML code.
arXiv Detail & Related papers (2021-11-23T00:48:49Z)
Self-service Data Classification Using Interactive Visualization and Interpretable Machine Learning [9.13755431537592]
Iterative Visual Logical (IVLC) is an interpretable machine learning algorithm. IVLC is especially helpful when dealing with sensitive and crucial data like cancer data in the medical domain. This chapter proposes an automated classification approach combined with new Coordinate Order (COO) algorithm and genetic algorithm.
arXiv Detail & Related papers (2021-07-11T05:39:14Z)
Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time. The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z)
Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation [97.42894942391575]
We propose FAST-DAD to distill arbitrarily complex ensemble predictors into individual models like boosted trees, random forests, and deep networks. Our individual distilled models are over 10x faster and more accurate than ensemble predictors produced by AutoML tools like H2O/AutoSklearn.
arXiv Detail & Related papers (2020-06-25T09:57:47Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.