Related papers: OpenDataVal: a Unified Benchmark for Data Valuation

OpenDataVal: a Unified Benchmark for Data Valuation

URL: http://arxiv.org/abs/2306.10577v3
Date: Fri, 13 Oct 2023 04:05:07 GMT
Title: OpenDataVal: a Unified Benchmark for Data Valuation
Authors: Kevin Fu Jiang, Weixin Liang, James Zou, Yongchan Kwon
Abstract summary: We introduce OpenDataVal, an easy-to-use and unified benchmark framework for data valuation. OpenDataVal provides an integrated environment that includes eleven different state-of-the-art data valuation algorithms. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches.
Score: 38.15852021170501
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing the quality and impact of individual data points is critical for improving model performance and mitigating undesirable biases within the training dataset. Several data valuation algorithms have been proposed to quantify data quality, however, there lacks a systemic and standardized benchmarking system for data valuation. In this paper, we introduce OpenDataVal, an easy-to-use and unified benchmark framework that empowers researchers and practitioners to apply and compare various data valuation algorithms. OpenDataVal provides an integrated environment that includes (i) a diverse collection of image, natural language, and tabular datasets, (ii) implementations of eleven different state-of-the-art data valuation algorithms, and (iii) a prediction model API that can import any models in scikit-learn. Furthermore, we propose four downstream machine learning tasks for evaluating the quality of data values. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches. We find that no single algorithm performs uniformly best across all tasks, and an appropriate algorithm should be employed for a user's downstream task. OpenDataVal is publicly available at https://opendataval.github.io with comprehensive documentation. Furthermore, we provide a leaderboard where researchers can evaluate the effectiveness of their own data valuation algorithms.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset. In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z)
Proper Dataset Valuation by Pointwise Mutual Information [26.693741797887643]
We propose an information-theoretic framework for evaluating data curation methods. We compare informativeness by the Shannon mutual information of the evaluated data and the test data. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies.
arXiv Detail & Related papers (2024-05-28T15:04:17Z)
Data Valuation by Leveraging Global and Local Statistical Information [25.911043100052588]
We show that both global and local value distributions hold significant potential for data valuation within the context of machine learning. We propose a new data valuation method that estimates Shapley values by incorporating the explored distribution characteristics into an existing method, AME. We also present a new path to address the dynamic data valuation problem by formulating an optimization problem that integrates information of both global and local value distributions.
arXiv Detail & Related papers (2024-05-23T08:58:08Z)
Data Valuation with Gradient Similarity [1.997283751398032]
Data Valuation algorithms quantify the value of each sample in a dataset based on its contribution or importance to a given predictive task. We present a simple alternative to existing methods, termed Data Valuation with Gradient Similarity (DVGS) Our approach has the ability to rapidly and accurately identify low-quality data, which can reduce the need for expert knowledge and manual intervention in data cleaning tasks.
arXiv Detail & Related papers (2024-05-13T22:10:00Z)
Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV) Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state. In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z)
Exploring Data Redundancy in Real-world Image Classification through Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs. We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data. Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z)
LAVA: Data Valuation without Pre-Specified Learning Algorithms [20.578106028270607]
We introduce a new framework that can value training data in a way that is oblivious to the downstream learning algorithm. We develop a proxy for the validation performance associated with a training set based on a non-conventional class-wise Wasserstein distance between training and validation sets. We show that the distance characterizes the upper bound of the validation performance for any given model under certain Lipschitz conditions.
arXiv Detail & Related papers (2023-04-28T19:05:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.