Related papers: Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces

Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces

URL: http://arxiv.org/abs/2306.15392v1
Date: Tue, 27 Jun 2023 11:33:31 GMT
Title: Assessing Dataset Quality Through Decision Tree Characteristics in Autoencoder-Processed Spaces
Authors: Szymon Mazurek, Maciej Wielgosz
Abstract summary: We show the profound impact of dataset quality on model training and performance. Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality. This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
Score: 0.30458514384586394
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we delve into the critical aspect of dataset quality assessment in machine learning classification tasks. Leveraging a variety of nine distinct datasets, each crafted for classification tasks with varying complexity levels, we illustrate the profound impact of dataset quality on model training and performance. We further introduce two additional datasets designed to represent specific data conditions - one maximizing entropy and the other demonstrating high redundancy. Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality in achieving high-performing machine learning models. To aid researchers and practitioners, we propose a comprehensive framework for dataset quality assessment, which can help evaluate if the dataset at hand is sufficient and of the required quality for specific tasks. This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.

Related papers

Scaling-up Perceptual Video Quality Assessment [54.691252495691955]
We show how to efficiently build high-quality, human-in-the-loop VQA multi-modal instruction databases.<n>Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge.<n>Our results demonstrate that our models achieve state-of-the-art performance in both quality understanding and rating tasks.
arXiv Detail & Related papers (2025-05-28T16:24:52Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
A Novel Metric for Measuring Data Quality in Classification Applications (extended version) [0.0]
We introduce and explain a novel metric to measure data quality. This metric is based on the correlated evolution between the classification performance and the deterioration of data. We provide an interpretation of each criterion and examples of assessment levels.
arXiv Detail & Related papers (2023-12-13T11:20:09Z)
Data Diversity Matters for Robust Instruction Tuning [129.83575908023312]
Recent works have shown that by curating high quality and diverse instruction tuning datasets, we can significantly improve instruction-following capabilities. We propose a new algorithm, Quality-Diversity Instruction Tuning (QDIT) to control dataset diversity and quality. We validate the performance of QDIT on several large scale instruction tuning datasets, where we find it can substantially improve worst and average case performance.
arXiv Detail & Related papers (2023-11-21T19:12:18Z)
Exploring Dataset-Scale Indicators of Data Quality [23.017200605976807]
Modern computer vision foundation models are trained on massive amounts of data, incurring large economic and environmental costs. Recent research has suggested that improving data quality can significantly reduce the need for data quantity. We posit that the quality of a given dataset can be decomposed into distinct sample-level and dataset-level constituents.
arXiv Detail & Related papers (2023-11-07T14:14:32Z)
A Data-centric Framework for Improving Domain-specific Machine Reading Comprehension Datasets [5.673449249014538]
Low-quality data can cause downstream problems in high-stakes applications. Data-centric approach emphasizes on improving dataset quality to enhance model performance.
arXiv Detail & Related papers (2023-04-02T08:26:38Z)
Striving for data-model efficiency: Identifying data externalities on group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population. Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z)
A Proposal to Study "Is High Quality Data All We Need?" [8.122270502556374]
We propose an empirical study that examines how to select a subset of and/or create high quality benchmark data. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets.
arXiv Detail & Related papers (2022-03-12T10:50:13Z)
Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z)
Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes. Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z)
Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data [0.15229257192293197]
We propose two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data.
arXiv Detail & Related papers (2021-01-05T10:23:08Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.