Related papers: Proper Dataset Valuation by Pointwise Mutual Information

Proper Dataset Valuation by Pointwise Mutual Information

URL: http://arxiv.org/abs/2405.18253v2
Date: Wed, 12 Feb 2025 06:47:33 GMT
Title: Proper Dataset Valuation by Pointwise Mutual Information
Authors: Shuran Zheng, Xuan Qi, Rui Ray Chen, Yongchan Kwon, James Zou,
Abstract summary: We propose an information-theoretic framework for evaluating data curation methods.<n>We compare informativeness by the Shannon mutual information of the evaluated data and the test data.<n>Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies.
Score: 26.693741797887643
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data plays a central role in the development of modern artificial intelligence, with high-quality data emerging as a key driver of model performance. This has prompted the development of various data curation methods in recent years. However, measuring the effectiveness of these data curation techniques remains a major challenge. Traditional evaluation methods, which assess a trained model's performance on specific benchmarks, risk promoting practices that merely make the data more similar to the test data. This issue exemplifies Goodhart's law: when a measure becomes a target, it ceases to be a good measure. To address this, we propose an information-theoretic framework for evaluating data curation methods, where dataset quality is measured by its informativeness about the true model parameters using the Blackwell ordering. We compare informativeness by the Shannon mutual information of the evaluated data and the test data, and we propose a novel method for estimating the mutual information of datasets by training Bayesian models on embedded data and computing the mutual information from the model's parameter posteriors. Experiments on real-world data demonstrate that our mutual information-based evaluation assigns appropriately lower scores to data curation strategies that reduce dataset informativeness, while traditional test score-based evaluation methods may favor data curation strategies that overfit to the test set but compromise the training data's informativeness.

Related papers

Privacy-Preserved Automated Scoring using Federated Learning for Educational Research [1.2556373621040728]
This study proposes a federated learning framework for automatic scoring in educational assessments. Student responses are processed locally on edge devices, and only optimized model parameters are shared with a central aggregation server. We evaluate our framework using assessment data from nine middle schools, comparing the accuracy of federated learning-based scoring models with traditionally trained centralized models.
arXiv Detail & Related papers (2025-03-12T19:06:25Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset. In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z)
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field. The absence of systematic data quality checks poses complications for properly training and testing models. We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Personalization of Dataset Retrieval Results using a Metadata-based Data Valuation Method [0.5999777817331317]
We propose a novel data valuation method for a dataset retrieval use case in Ireland's National mapping agency. By leveraging metadata and a user's preferences, we estimate the personal value of each dataset. We validated the data value-based ranking against the stakeholders' ranking of the datasets.
arXiv Detail & Related papers (2024-07-22T11:13:07Z)
Is Data Valuation Learnable and Interpretable? [3.9325957466009203]
Current data valuation methods ignore the interpretability of the output values. This study aims to answer an important question: is data valuation learnable and interpretable?
arXiv Detail & Related papers (2024-06-03T08:13:47Z)
A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples. Existing literature surveys only focus on a certain type of specific modality data. We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z)
Neural Dynamic Data Valuation [4.286118155737111]
We propose a novel data valuation method from the perspective of optimal control, named the neural dynamic data valuation (NDDV) Our method has solid theoretical interpretations to accurately identify the data valuation via the sensitivity of the data optimal control state. In addition, we implement a data re-weighting strategy to capture the unique features of data points, ensuring fairness through the interaction between data points and the mean-field states.
arXiv Detail & Related papers (2024-04-30T13:39:26Z)
Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages. Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z)
TRIAGE: Characterizing and auditing training data for improved regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors. TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score. We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z)
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z)
OpenDataVal: a Unified Benchmark for Data Valuation [38.15852021170501]
We introduce OpenDataVal, an easy-to-use and unified benchmark framework for data valuation. OpenDataVal provides an integrated environment that includes eleven different state-of-the-art data valuation algorithms. We perform benchmarking analysis using OpenDataVal, quantifying and comparing the efficacy of state-of-the-art data valuation approaches.
arXiv Detail & Related papers (2023-06-18T14:38:29Z)
Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value [17.340091573913316]
We propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. Data-OOB takes less than 2.25 hours on a single CPU processor when there are $106$ samples to evaluate and the input dimension is 100. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points.
arXiv Detail & Related papers (2023-04-16T08:03:58Z)
Statistical Dataset Evaluation: Reliability, Difficulty, and Validity [18.36931975072938]
We propose a model-agnostic dataset evaluation framework for automatic dataset quality evaluation. We seek the statistical properties of the datasets and address three fundamental dimensions: reliability, difficulty, and validity.
arXiv Detail & Related papers (2022-12-19T06:55:42Z)
Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics [58.50754318846996]
In this paper, we show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets.
arXiv Detail & Related papers (2022-03-29T18:58:28Z)
Generating Data to Mitigate Spurious Correlations in Natural Language Inference Datasets [27.562256973255728]
Natural language processing models often exploit spurious correlations between task-independent features and labels in datasets to perform well only within the distributions they are trained on. We propose to tackle this problem by generating a debiased version of a dataset, which can then be used to train a debiased, off-the-shelf model. Our approach consists of 1) a method for training data generators to generate high-quality, label-consistent data samples; and 2) a filtering mechanism for removing data points that contribute to spurious correlations.
arXiv Detail & Related papers (2022-03-24T09:08:05Z)
Data-SUITE: Data-centric identification of in-distribution incongruous examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data. We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance. We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.