Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI
- URL: http://arxiv.org/abs/2112.03837v1
- Date: Tue, 7 Dec 2021 17:22:44 GMT
- Title: Augment & Valuate : A Data Enhancement Pipeline for Data-Centric AI
- Authors: Youngjune Lee, Oh Joon Kwon, Haeju Lee, Joonyoung Kim, Kangwook Lee,
Kee-Eung Kim
- Abstract summary: We propose a data-centric approach to address the fundamental distributional and semantic properties of dataset with black box models.
We achieve 84.711% test accuracy (ranked #6, Honorable Mention in the Most Innovative) in the Data-Centric AI competition only with the provided dataset.
- Score: 19.358073575300004
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data scarcity and noise are important issues in industrial applications of
machine learning. However, it is often challenging to devise a scalable and
generalized approach to address the fundamental distributional and semantic
properties of dataset with black box models. For this reason, data-centric
approaches are crucial for the automation of machine learning operation
pipeline. In order to serve as the basis for this automation, we suggest a
domain-agnostic pipeline for refining the quality of data in image
classification problems. This pipeline contains data valuation, cleansing, and
augmentation. With an appropriate combination of these methods, we could
achieve 84.711% test accuracy (ranked #6, Honorable Mention in the Most
Innovative) in the Data-Centric AI competition only with the provided dataset.
Related papers
- Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond [38.89457061559469]
We propose an innovative methodology that automates dataset creation with negligible cost and high efficiency.
We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data.
We design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning.
arXiv Detail & Related papers (2024-08-21T04:45:12Z) - Automated data processing and feature engineering for deep learning and big data applications: a survey [0.0]
Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly from data.
Not all data processing tasks in conventional deep learning pipelines have been automated.
arXiv Detail & Related papers (2024-03-18T01:07:48Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - On-the-fly Denoising for Data Augmentation in Natural Language
Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.
Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z) - Deep Learning based pipeline for anomaly detection and quality
enhancement in industrial binder jetting processes [68.8204255655161]
Anomaly detection describes methods of finding abnormal states, instances or data points that differ from a normal value space.
This paper contributes to a data-centric way of approaching artificial intelligence in industrial production.
arXiv Detail & Related papers (2022-09-21T08:14:34Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - AutoDO: Robust AutoAugment for Biased Data with Label Noise via Scalable
Probabilistic Implicit Differentiation [3.118384520557952]
AutoAugment has sparked an interest in automated augmentation methods for deep learning models.
We show that those methods are not robust when applied to biased and noisy data.
We reformulate AutoAugment as a generalized automated dataset optimization (AutoDO) task.
Our experiments show up to 9.3% improvement for biased datasets with label noise compared to prior methods.
arXiv Detail & Related papers (2021-03-10T04:05:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.