EcoVal: An Efficient Data Valuation Framework for Machine Learning
- URL: http://arxiv.org/abs/2402.09288v5
- Date: Tue, 9 Jul 2024 08:59:44 GMT
- Title: EcoVal: An Efficient Data Valuation Framework for Machine Learning
- Authors: Ayush K Tarun, Vikram S Chundawat, Murari Mandal, Hong Ming Tan, Bowei Chen, Mohan Kankanhalli,
- Abstract summary: Existing Shapley value based frameworks for data valuation in machine learning are computationally expensive.
We introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner.
- Score: 11.685518953430554
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Quantifying the value of data within a machine learning workflow can play a pivotal role in making more strategic decisions in machine learning initiatives. The existing Shapley value based frameworks for data valuation in machine learning are computationally expensive as they require considerable amount of repeated training of the model to obtain the Shapley value. In this paper, we introduce an efficient data valuation framework EcoVal, to estimate the value of data for machine learning models in a fast and practical manner. Instead of directly working with individual data sample, we determine the value of a cluster of similar data points. This value is further propagated amongst all the member cluster points. We show that the overall value of the data can be determined by estimating the intrinsic and extrinsic value of each data. This is enabled by formulating the performance of a model as a \textit{production function}, a concept which is popularly used to estimate the amount of output based on factors like labor and capital in a traditional free economic market. We provide a formal proof of our valuation technique and elucidate the principles and mechanisms that enable its accelerated performance. We demonstrate the real-world applicability of our method by showcasing its effectiveness for both in-distribution and out-of-sample data. This work addresses one of the core challenges of efficient data valuation at scale in machine learning models. The code is available at \underline{https://github.com/respai-lab/ecoval}.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Reframing Data Value for Large Language Models Through the Lens of Plausibility [6.697702130929693]
We propose an alternative perspective on the data value problem for language models.
We develop a novel value function that is computationally tractable and derived from first principles with provable properties.
arXiv Detail & Related papers (2024-08-30T22:32:24Z) - CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning [0.0]
We propose CHG Shapley, which approximates the utility of each data subset on model accuracy during a single model training.
We employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data.
arXiv Detail & Related papers (2024-06-17T16:48:31Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Is Data Valuation Learnable and Interpretable? [3.9325957466009203]
Current data valuation methods ignore the interpretability of the output values.
This study aims to answer an important question: is data valuation learnable and interpretable?
arXiv Detail & Related papers (2024-06-03T08:13:47Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - Shapley Value on Probabilistic Classifiers [6.163093930860032]
In the context of machine learning (ML), data valuation methods aim to equitably measure the contribution of each data point to the utility of an ML model.
Traditional Shapley-based data valuation methods may not effectively distinguish between beneficial and detrimental training data points.
We propose Probabilistic Shapley (P-Shapley) value by constructing a probability-wise utility function.
arXiv Detail & Related papers (2023-06-12T15:09:13Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Data Banzhaf: A Robust Data Valuation Framework for Machine Learning [18.65808473565554]
This paper studies the robustness of data valuation to noisy model performance scores.
We introduce the concept of safety margin, which measures the robustness of a data value notion.
We show that the Banzhaf value achieves the largest safety margin among all semivalues.
arXiv Detail & Related papers (2022-05-30T23:44:09Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.