Data Collection and Quality Challenges in Deep Learning: A Data-Centric
AI Perspective
- URL: http://arxiv.org/abs/2112.06409v1
- Date: Mon, 13 Dec 2021 03:57:36 GMT
- Title: Data Collection and Quality Challenges in Deep Learning: A Data-Centric
AI Perspective
- Authors: Steven Euijong Whang, Yuji Roh, Hwanjun Song, Jae-Gil Lee
- Abstract summary: Data-centric AI practices are now becoming mainstream.
Many datasets in the real world are small, dirty, biased, and even poisoned.
For data quality, we study data validation and data cleaning techniques.
- Score: 16.480530590466472
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software 2.0 is a fundamental shift in software engineering where machine
learning becomes the new software, powered by big data and computing
infrastructure. As a result, software engineering needs to be re-thought where
data becomes a first-class citizen on par with code. One striking observation
is that 80-90% of the machine learning process is spent on data preparation.
Without good data, even the best machine learning algorithms cannot perform
well. As a result, data-centric AI practices are now becoming mainstream.
Unfortunately, many datasets in the real world are small, dirty, biased, and
even poisoned. In this survey, we study the research landscape for data
collection and data quality primarily for deep learning applications. Data
collection is important because there is lesser need for feature engineering
for recent deep learning approaches, but instead more need for large amounts of
data. For data quality, we study data validation and data cleaning techniques.
Even if the data cannot be fully cleaned, we can still cope with imperfect data
during model training where using robust model training techniques. In
addition, while bias and fairness have been less studied in traditional data
management research, these issues become essential topics in modern machine
learning applications. We thus study fairness measures and unfairness
mitigation techniques that can be applied before, during, or after model
training. We believe that the data management community is well poised to solve
problems in these directions.
Related papers
- Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering [13.17302533571231]
Deep learning (DL) systems are prone to bugs from many sources, including training data.
Existing literature suggests that bugs in training data are highly prevalent.
We investigate three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based.
arXiv Detail & Related papers (2024-11-19T00:28:20Z) - Releasing Malevolence from Benevolence: The Menace of Benign Data on Machine Unlearning [28.35038726318893]
Machine learning models trained on vast amounts of real or synthetic data often achieve outstanding predictive performance across various domains.
To address privacy concerns, machine unlearning has been proposed to erase specific data samples from models.
We introduce the Unlearning Usability Attack to distill data distribution information into a small set of benign data.
arXiv Detail & Related papers (2024-07-06T15:42:28Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - How to Do Machine Learning with Small Data? -- A Review from an
Industrial Perspective [1.443696537295348]
Authors focus on interpreting the general term of "small data" and their engineering and industrial application role.
Small data is defined in terms of various characteristics compared to big data, and a machine learning formalism was introduced.
Five critical challenges of machine learning with small data in industrial applications are presented.
arXiv Detail & Related papers (2023-11-13T07:39:13Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Data Budgeting for Machine Learning [17.524791147624086]
We study the data budgeting problem and formulate it as two sub-problems.
We propose a learning method to solve data budgeting problems.
Our empirical evaluation shows that it is possible to perform data budgeting given a small pilot study dataset with as few as $50$ data points.
arXiv Detail & Related papers (2022-10-03T14:53:17Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z) - Kubric: A scalable dataset generator [73.78485189435729]
Kubric is a Python framework that interfaces with PyBullet and Blender to generate photo-realistic scenes, with rich annotations, and seamlessly scales to large jobs distributed over thousands of machines.
We demonstrate the effectiveness of Kubric by presenting a series of 13 different generated datasets for tasks ranging from studying 3D NeRF models to optical flow estimation.
arXiv Detail & Related papers (2022-03-07T18:13:59Z) - Understanding the World Through Action [91.3755431537592]
I will argue that a general, principled, and powerful framework for utilizing unlabeled data can be derived from reinforcement learning.
I will discuss how such a procedure is more closely aligned with potential downstream tasks.
arXiv Detail & Related papers (2021-10-24T22:33:52Z) - Data science on industrial data -- Today's challenges in brown field
applications [0.0]
This paper shows state of the art and what to expect when working with stock machines in the field.
A major focus in this paper is on data collection which can be more cumbersome than most people might expect.
Data quality for machine learning applications is a challenge once leaving the laboratory.
arXiv Detail & Related papers (2020-06-10T10:05:16Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.