Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces
- URL: http://arxiv.org/abs/2306.15392v1
- Date: Tue, 27 Jun 2023 11:33:31 GMT
- Title: Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces
- Authors: Szymon Mazurek, Maciej Wielgosz
- Abstract summary: We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
- Score: 0.30458514384586394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we delve into the critical aspect of dataset quality
assessment in machine learning classification tasks. Leveraging a variety of
nine distinct datasets, each crafted for classification tasks with varying
complexity levels, we illustrate the profound impact of dataset quality on
model training and performance. We further introduce two additional datasets
designed to represent specific data conditions - one maximizing entropy and the
other demonstrating high redundancy. Our findings underscore the importance of
appropriate feature selection, adequate data volume, and data quality in
achieving high-performing machine learning models. To aid researchers and
practitioners, we propose a comprehensive framework for dataset quality
assessment, which can help evaluate if the dataset at hand is sufficient and of
the required quality for specific tasks. This research offers valuable insights
into data assessment practices, contributing to the development of more
accurate and robust machine learning models.
Related papers
- LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - A Novel Metric for Measuring Data Quality in Classification Applications
(extended version) [0.0]
We introduce and explain a novel metric to measure data quality.
This metric is based on the correlated evolution between the classification performance and the deterioration of data.
We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z) - Data Diversity Matters for Robust Instruction Tuning [129.83575908023312]
Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities.
We propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT) to control dataset diversity and quality.
We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance.
arXiv Detail & Related papers (2023-11-21T19:12:18Z) - Exploring Dataset-Scale Indicators of Data Quality [23.017200605976807]
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs.
Recent research has suggested that improving data quality can significantly reduce the need for data quantity.
We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents.
arXiv Detail & Related papers (2023-11-07T14:14:32Z) - A Data-centric Framework for Improving Domain-specific Machine Reading
Comprehension Datasets [5.673449249014538]
Low-quality data can cause downstream problems in high-stakes applications.
Data-centric approach emphasizes on improving dataset quality to enhance model performance.
arXiv Detail & Related papers (2023-04-02T08:26:38Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - A Proposal to Study "Is High Quality Data All We Need?" [8.122270502556374]
We propose an empirical study that examines how to select a subset of and/or create high quality benchmark data.
We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets.
arXiv Detail & Related papers (2022-03-12T10:50:13Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes.
Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z) - Data Quality Measures and Efficient Evaluation Algorithms for
Large-Scale High-Dimensional Data [0.15229257192293197]
We propose two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset.
We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data.
arXiv Detail & Related papers (2021-01-05T10:23:08Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.