Efficiently Verifiable Proofs of Data Attribution
- URL: http://arxiv.org/abs/2508.10866v2
- Date: Sun, 17 Aug 2025 22:05:23 GMT
- Title: Efficiently Verifiable Proofs of Data Attribution
- Authors: Ari Karchmer, Martin Pawelczyk, Seth Neel,
- Abstract summary: We propose an interactive verification paradigm for data attribution.<n>We provide formal completeness, soundness, and efficiency guarantees in the sense of Probably-Approximately-Correct (PAC) verification.
- Score: 9.05608916348947
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data attribution methods aim to answer useful counterfactual questions like "what would a ML model's prediction be if it were trained on a different dataset?" However, estimation of data attribution models through techniques like empirical influence or "datamodeling" remains very computationally expensive. This causes a critical trust issue: if only a few computationally rich parties can obtain data attributions, how can resource-constrained parties trust that the provided attributions are indeed "good," especially when they are used for important downstream applications (e.g., data pricing)? In this paper, we address this trust issue by proposing an interactive verification paradigm for data attribution. An untrusted and computationally powerful Prover learns data attributions, and then engages in an interactive proof with a resource-constrained Verifier. Our main result is a protocol that provides formal completeness, soundness, and efficiency guarantees in the sense of Probably-Approximately-Correct (PAC) verification. Specifically, if both Prover and Verifier follow the protocol, the Verifier accepts data attributions that are {\epsilon}-close to the optimal data attributions (in terms of the Mean Squared Error) with probability 1-{\delta}. Conversely, if the Prover arbitrarily deviates from the protocol, even with infinite compute, then this is detected (or it still yields data attributions to the Verifier) except with probability {\delta}. Importantly, our protocol ensures the Verifier's workload, measured by the number of independent model retrainings it must perform, scales only as O(1/{\epsilon}); i.e., independently of the dataset size. At a technical level, our results apply to efficiently verifying any linear function over the boolean hypercube computed by the Prover, making them broadly applicable to various attribution tasks.
Related papers
- Perfectly-Private Analog Secure Aggregation in Federated Learning [51.61616734974475]
In federated learning, multiple parties train models locally and share their parameters with a central server, which aggregates them to update a global model.<n>In this paper, a novel secure parameter aggregation method is proposed that employs the torus rather than a finite field.
arXiv Detail & Related papers (2025-09-10T15:22:40Z) - DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets.<n>Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining.<n>Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z) - VeriFlow: Modeling Distributions for Neural Network Verification [3.510536859655114]
Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks.<n>We propose the VeriFlow architecture as a flow-based density model tailored to allow any verification approach to restrict its search to some data distribution of interest.
arXiv Detail & Related papers (2024-06-20T12:41:39Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Federated Causal Discovery from Heterogeneous Data [70.31070224690399]
We propose a novel FCD method attempting to accommodate arbitrary causal models and heterogeneous data.
These approaches involve constructing summary statistics as a proxy of the raw data to protect data privacy.
We conduct extensive experiments on synthetic and real datasets to show the efficacy of our method.
arXiv Detail & Related papers (2024-02-20T18:53:53Z) - Gaussian Process Probes (GPP) for Uncertainty-Aware Probing [61.91898698128994]
We introduce a unified and simple framework for probing and measuring uncertainty about concepts represented by models.
Our experiments show it can (1) probe a model's representations of concepts even with a very small number of examples, (2) accurately measure both epistemic uncertainty (how confident the probe is) and aleatory uncertainty (how fuzzy the concepts are to the model), and (3) detect out of distribution data using those uncertainty measures as well as classic methods do.
arXiv Detail & Related papers (2023-05-29T17:00:16Z) - Conditional Feature Importance for Mixed Data [1.6114012813668934]
We develop a conditional predictive impact (CPI) framework with knockoff sampling.
We show that our proposed workflow controls type I error, achieves high power and is in line with results given by other conditional FI measures.
Our findings highlight the necessity of developing statistically adequate, specialized methods for mixed data.
arXiv Detail & Related papers (2022-10-06T16:52:38Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Automatically detecting data drift in machine learning classifiers [2.202253618096515]
We term changes that affect machine learning performance data drift' or drift'
We propose an approach based solely on classifier suggested labels and its confidence in them, for alerting on data distribution or feature space changes that are likely to cause data drift.
arXiv Detail & Related papers (2021-11-10T12:34:14Z) - Fair Densities via Boosting the Sufficient Statistics of Exponential
Families [72.34223801798422]
We introduce a boosting algorithm to pre-process data for fairness.
Our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee.
Empirical results are present to display the quality of result on real-world data.
arXiv Detail & Related papers (2020-12-01T00:49:17Z) - Privacy-Preserving Gaussian Process Regression -- A Modular Approach to
the Application of Homomorphic Encryption [4.1499725848998965]
Homomorphic encryption (FHE) allows data to be computed on whilst encrypted.
Some commonly used machine learning algorithms, such as Gaussian process regression, are poorly suited to FHE.
We show that a modular approach, which applies FHE to only the sensitive steps of a workflow that need protection, allows one party to make predictions on their data.
arXiv Detail & Related papers (2020-01-28T11:50:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.