Data Quality Measures and Efficient Evaluation Algorithms for
Large-Scale High-Dimensional Data
- URL: http://arxiv.org/abs/2101.01441v1
- Date: Tue, 5 Jan 2021 10:23:08 GMT
- Title: Data Quality Measures and Efficient Evaluation Algorithms for
Large-Scale High-Dimensional Data
- Authors: Hyeongmin Cho, Sangkyun Lee
- Abstract summary: We propose two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset.
We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data.
- Score: 0.15229257192293197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning has been proven to be effective in various application
areas, such as object and speech recognition on mobile systems. Since a
critical key to machine learning success is the availability of large training
data, many datasets are being disclosed and published online. From a data
consumer or manager point of view, measuring data quality is an important first
step in the learning process. We need to determine which datasets to use,
update, and maintain. However, not many practical ways to measure data quality
are available today, especially when it comes to large-scale high-dimensional
data, such as images and videos. This paper proposes two data quality measures
that can compute class separability and in-class variability, the two important
aspects of data quality, for a given dataset. Classical data quality measures
tend to focus only on class separability; however, we suggest that in-class
variability is another important data quality factor. We provide efficient
algorithms to compute our quality measures based on random projections and
bootstrapping with statistical benefits on large-scale high-dimensional data.
In experiments, we show that our measures are compatible with classical
measures on small-scale data and can be computed much more efficiently on
large-scale high-dimensional datasets.
Related papers
- Scaling Laws for the Value of Individual Data Points in Machine Learning [55.596413470429475]
We introduce a new perspective by investigating scaling behavior for the value of individual data points.
We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes.
Our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points.
arXiv Detail & Related papers (2024-05-30T20:10:24Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Exploring Dataset-Scale Indicators of Data Quality [23.017200605976807]
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs.
Recent research has suggested that improving data quality can significantly reduce the need for data quantity.
We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents.
arXiv Detail & Related papers (2023-11-07T14:14:32Z) - ECS -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
We present a novel approach for the assurance of data quality.
For this purpose, the mathematical basics are first discussed and the approach is presented using multiple examples.
This results in the detection of data points with potentially harmful properties for the use in safety-critical systems.
arXiv Detail & Related papers (2023-07-10T06:49:18Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z) - How Much More Data Do I Need? Estimating Requirements for Downstream
Tasks [99.44608160188905]
Given a small training data set and a learning algorithm, how much more data is necessary to reach a target validation or test performance?
Overestimating or underestimating data requirements incurs substantial costs that could be avoided with an adequate budget.
Using our guidelines, practitioners can accurately estimate data requirements of machine learning systems to gain savings in both development time and data acquisition costs.
arXiv Detail & Related papers (2022-07-04T21:16:05Z) - Homogenization of Existing Inertial-Based Datasets to Support Human
Activity Recognition [8.076841611508486]
Several techniques have been proposed to address the problem of recognizing activities of daily living from signals.
Deep learning techniques applied to inertial signals have proven to be effective, achieving significant classification accuracy.
Research in human activity recognition models has been almost totally model-centric.
arXiv Detail & Related papers (2022-01-17T14:29:48Z) - Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z) - On the Use of Interpretable Machine Learning for the Management of Data
Quality [13.075880857448059]
We propose the use of interpretable machine learning to deliver the features that are important to be based for any data processing activity.
Our aim is to secure data quality, at least, for those features that are detected as significant in the collected datasets.
arXiv Detail & Related papers (2020-07-29T08:49:32Z) - What is the Value of Data? On Mathematical Methods for Data Quality
Estimation [35.75162309592681]
We propose a formal definition for the quality of a given dataset.
We assess a dataset's quality by a quantity we call the expected diameter.
arXiv Detail & Related papers (2020-01-09T18:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.