Ensuring Dataset Quality for Machine Learning Certification
- URL: http://arxiv.org/abs/2011.01799v1
- Date: Tue, 3 Nov 2020 15:45:43 GMT
- Title: Ensuring Dataset Quality for Machine Learning Certification
- Authors: Sylvaine Picard, Camille Chapdelaine, Cyril Cappi, Laurent Gardes,
Eric Jenn, Baptiste Lef\`evre, Thomas Soumarmon
- Abstract summary: We show that the specificities of the Machine Learning context are neither properly captured nor taken into ac-count.
We propose a dataset specification and verification process, and apply it on a signal recognition system from the railway domain.
- Score: 0.6927055673104934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address the problem of dataset quality in the context of
Machine Learning (ML)-based critical systems. We briefly analyse the
applicability of some existing standards dealing with data and show that the
specificities of the ML context are neither properly captured nor taken into
ac-count. As a first answer to this concerning situation, we propose a dataset
specification and verification process, and apply it on a signal recognition
system from the railway domain. In addi-tion, we also give a list of
recommendations for the collection and management of datasets. This work is one
step towards the dataset engineering process that will be required for ML to be
used on safety critical systems.
Related papers
- How to design a dataset compliant with an ML-based system ODD? [5.432478272457867]
This paper focuses on a Vision-based Landing task and presents the design and validation of a dataset that would comply with the Operational Design Domain (ODD) of a Machine-Learning (ML) system.
Relying on emerging certification standards, we describe the process for establishing ODDs at both the system and image levels.
arXiv Detail & Related papers (2024-06-20T06:48:34Z) - A Systematic Review of Available Datasets in Additive Manufacturing [56.684125592242445]
In-situ monitoring incorporating visual and other sensor technologies allows the collection of extensive datasets during the Additive Manufacturing process.
These datasets have potential for determining the quality of the manufactured output and the detection of defects through the use of Machine Learning.
This systematic review investigates the availability of open image-based datasets originating from AM processes that align with a number of pre-defined selection criteria.
arXiv Detail & Related papers (2024-01-27T16:13:32Z) - ECS -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
We present a novel approach for the assurance of data quality.
For this purpose, the mathematical basics are first discussed and the approach is presented using multiple examples.
This results in the detection of data points with potentially harmful properties for the use in safety-critical systems.
arXiv Detail & Related papers (2023-07-10T06:49:18Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Quality In / Quality Out: Assessing Data quality in an Anomaly Detection
Benchmark [0.13764085113103217]
We show that relatively minor modifications on the same benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific Machine Learning technique considered.
Our findings illustrate the need to devote more attention into (automatic) data quality assessment and optimization techniques in the context of autonomous networks.
arXiv Detail & Related papers (2023-05-31T12:03:12Z) - Benchmarking Automated Machine Learning Methods for Price Forecasting
Applications [58.720142291102135]
We show the possibility of substituting manually created ML pipelines with automated machine learning (AutoML) solutions.
Based on the CRISP-DM process, we split the manual ML pipeline into a machine learning and non-machine learning part.
We show in a case study for the industrial use case of price forecasting, that domain knowledge combined with AutoML can weaken the dependence on ML experts.
arXiv Detail & Related papers (2023-04-28T10:27:38Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - Training from Zero: Radio Frequency Machine Learning Data Quantity Forecasting [0.0]
The data used during training in any given application space is directly tied to the performance of the system once deployed.
One of the underlying rule of thumbs used within the machine learning space is that more data leads to better models.
This work examines a modulation classification problem in the Radio Frequency domain space.
arXiv Detail & Related papers (2022-05-07T18:45:06Z) - Automatic Feasibility Study via Data Quality Analysis for ML: A
Case-Study on Label Noise [21.491392581672198]
We present Snoopy, with the goal of supporting data scientists and machine learning engineers performing a systematic and theoretically founded feasibility study.
We approach this problem by estimating the irreducible error of the underlying task, also known as the Bayes error rate (BER)
We demonstrate in end-to-end experiments how users are able to save substantial labeling time and monetary efforts.
arXiv Detail & Related papers (2020-10-16T14:21:19Z) - Unsupervised Quality Estimation for Neural Machine Translation [63.38918378182266]
Existing approaches require large amounts of expert annotated data, computation and time for training.
We devise an unsupervised approach to QE where no training or access to additional resources besides the MT system itself is required.
We achieve very good correlation with human judgments of quality, rivalling state-of-the-art supervised QE models.
arXiv Detail & Related papers (2020-05-21T12:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.