Automatic Feasibility Study via Data Quality Analysis for ML: A
Case-Study on Label Noise
- URL: http://arxiv.org/abs/2010.08410v4
- Date: Tue, 30 Aug 2022 12:14:18 GMT
- Title: Automatic Feasibility Study via Data Quality Analysis for ML: A
Case-Study on Label Noise
- Authors: Cedric Renggli, Luka Rimanic, Luka Kolar, Wentao Wu, Ce Zhang
- Abstract summary: We present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study.
We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER)
We demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.
- Score: 21.491392581672198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In our experience of working with domain experts who are using today's AutoML
systems, a common problem we encountered is what we call "unrealistic
expectations" -- when users are facing a very challenging task with a noisy
data acquisition process, while being expected to achieve startlingly high
accuracy with machine learning (ML). Many of these are predestined to fail from
the beginning. In traditional software engineering, this problem is addressed
via a feasibility study, an indispensable step before developing any software
system. In this paper, we present Snoopy, with the goal of supporting data
scientists and machine learning engineers performing a systematic and
theoretically founded feasibility study before building ML applications. We
approach this problem by estimating the irreducible error of the underlying
task, also known as the Bayes error rate (BER), which stems from data quality
issues in datasets used to train or evaluate ML model artifacts. We design a
practical Bayes error estimator that is compared against baseline feasibility
study candidates on 6 datasets (with additional real and synthetic noise of
different levels) in computer vision and natural language processing.
Furthermore, by including our systematic feasibility study with additional
signals into the iterative label cleaning process, we demonstrate in end-to-end
experiments how users are able to save substantial labeling time and monetary
efforts.
Related papers
- Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Zero-knowledge Proof Meets Machine Learning in Verifiability: A Survey [19.70499936572449]
High-quality models rely not only on efficient optimization algorithms but also on the training and learning processes built upon vast amounts of data and computational power.
Due to various challenges such as limited computational resources and data privacy concerns, users in need of models often cannot train machine learning models locally.
This paper presents a comprehensive survey of zero-knowledge proof-based verifiable machine learning (ZKP-VML) technology.
arXiv Detail & Related papers (2023-10-23T12:15:23Z) - Benchmarking Automated Machine Learning Methods for Price Forecasting
Applications [58.720142291102135]
We show the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions.
Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part.
We show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts.
arXiv Detail & Related papers (2023-04-28T10:27:38Z) - Deep Transfer Learning for Automatic Speech Recognition: Towards Better
Generalization [3.6393183544320236]
Speech recognition has become an important challenge when using deep learning (DL)
It requires large-scale training datasets and high computational and storage resources.
Deep transfer learning (DTL) has been introduced to overcome these issues.
arXiv Detail & Related papers (2023-04-27T21:08:05Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - Representation Learning for the Automatic Indexing of Sound Effects
Libraries [79.68916470119743]
We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size.
Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness.
arXiv Detail & Related papers (2022-08-18T23:46:13Z) - Challenges and Solutions to Build a Data Pipeline to Identify Anomalies
in Enterprise System Performance [3.037408957267527]
We discuss challenges to harness data to operate our ML-based anomaly detection system.
We demonstrate that by addressing these data challenges, we not only improve the accuracy of our performance anomaly detection model by 30%, but also ensure that the model performance to never degrade over time.
arXiv Detail & Related papers (2021-12-13T22:30:07Z) - Detecting Requirements Smells With Deep Learning: Experiences,
Challenges and Future Work [9.44316959798363]
This work aims to improve the previous work by creating a manually labeled dataset and using ensemble learning, Deep Learning (DL), and techniques such as word embeddings and transfer learning to overcome the generalization problem.
The current findings show that the dataset is unbalanced and which class examples should be added more.
arXiv Detail & Related papers (2021-08-06T12:45:15Z) - Towards Model-informed Precision Dosing with Expert-in-the-loop Machine
Learning [0.0]
We consider a ML framework that may accelerate model learning and improve its interpretability by incorporating human experts into the model learning loop.
We propose a novel human-in-the-loop ML framework aimed at dealing with learning problems that the cost of data annotation is high.
With an application to precision dosing, our experimental results show that the approach can learn interpretable rules from data and may potentially lower experts' workload.
arXiv Detail & Related papers (2021-06-28T03:45:09Z) - Technology Readiness Levels for Machine Learning Systems [107.56979560568232]
Development and deployment of machine learning systems can be executed easily with modern tools, but the process is typically rushed and means-to-an-end.
We have developed a proven systems engineering approach for machine learning development and deployment.
Our "Machine Learning Technology Readiness Levels" framework defines a principled process to ensure robust, reliable, and responsible systems.
arXiv Detail & Related papers (2021-01-11T15:54:48Z) - Towards CRISP-ML(Q): A Machine Learning Process Model with Quality
Assurance Methodology [53.063411515511056]
We propose a process model for the development of machine learning applications.
The first phase combines business and data understanding as data availability oftentimes affects the feasibility of the project.
The sixth phase covers state-of-the-art approaches for monitoring and maintenance of a machine learning applications.
arXiv Detail & Related papers (2020-03-11T08:25:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.