Related papers: A General Framework for Data-Use Auditing of ML Models

A General Framework for Data-Use Auditing of ML Models

URL: http://arxiv.org/abs/2407.15100v2
Date: Sun, 4 Aug 2024 05:55:40 GMT
Title: A General Framework for Data-Use Auditing of ML Models
Authors: Zonghao Huang, Neil Zhenqiang Gong, Michael K. Reiter,
Abstract summary: We propose a general method to audit an ML model for the use of a data-owner's data in training. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models.
Score: 47.369572284751285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Auditing the use of data in training machine-learning (ML) models is an increasingly pressing challenge, as myriad ML practitioners routinely leverage the effort of content creators to train models without their permission. In this paper, we propose a general method to audit an ML model for the use of a data-owner's data in training, without prior knowledge of the ML task for which the data might be used. Our method leverages any existing black-box membership inference method, together with a sequential hypothesis test of our own design, to detect data use with a quantifiable, tunable false-detection rate. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models, namely image classifiers and foundation models.

Related papers

Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Instance-Level Data-Use Auditing of Visual ML Models [47.369572284751285]
Growing trend of legal disputes over the unauthorized use of data in machine learning (ML) systems highlights the need for reliable data-use auditing mechanisms. We present the first proactive instance-level data-use auditing method designed to enable data owners to audit the use of their individual data instances in ML models.
arXiv Detail & Related papers (2025-03-28T13:28:57Z)
Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models [73.94175015918059]
We propose a dataset-level membership inference method based on Self-Comparison. Our method does not require access to ground-truth member data or non-member data in identical distribution.
arXiv Detail & Related papers (2024-10-16T23:05:59Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options. Our method is able to work under black-box conditions without access to model training data or weights. We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z)
Harnessing Large Language Models as Post-hoc Correctors [6.288056740658763]
We show that an LLM can work as a post-hoc corrector to propose corrections for the predictions of an arbitrary Machine Learning model. We form a contextual knowledge database by incorporating the dataset's label information and the ML model's predictions on the validation dataset. Our experimental results on text analysis and the challenging molecular predictions show that model improves the performance of a number of models by up to 39%.
arXiv Detail & Related papers (2024-02-20T22:50:41Z)
Learn to Unlearn: A Survey on Machine Unlearning [29.077334665555316]
This article presents a review of recent machine unlearning techniques, verification mechanisms, and potential attacks. We highlight emerging challenges and prospective research directions. We aim for this paper to provide valuable resources for integrating privacy, equity, andresilience into ML systems.
arXiv Detail & Related papers (2023-05-12T14:28:02Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Deep Learning model integrity checking mechanism using watermarking technique [0.0]
We propose a model integrity-checking mechanism that uses model watermarking techniques to monitor the integrity of ML models. Our proposed technique can monitor the integrity of ML models even when the model is further trained on newer data with a low computational cost.
arXiv Detail & Related papers (2023-01-29T03:05:53Z)
QuantifyML: How Good is my Machine Learning Model? [0.0]
QuantifyML aims to quantify the extent to which machine learning models have learned and generalized from the given data. The formula is analyzed with off-the-shelf model counters to obtain precise counts with respect to different model behavior.
arXiv Detail & Related papers (2021-10-25T01:56:01Z)
Supervised Machine Learning with Plausible Deniability [1.685485565763117]
We study the question of how well machine learning (ML) models trained on a certain data set provide privacy for the training data. We show that one can take a set of purely random training data, and from this define a suitable learning rule'' that will produce a ML model that is exactly $f$.
arXiv Detail & Related papers (2021-06-08T11:54:51Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources [78.72922528736011]
We propose a novel approach, black-box adversarial reprogramming (BAR), that repurposes a well-trained black-box machine learning model. Using zeroth order optimization and multi-label mapping techniques, BAR can reprogram a black-box ML model solely based on its input-output responses. BAR outperforms state-of-the-art methods and yields comparable performance to the vanilla adversarial reprogramming method.
arXiv Detail & Related papers (2020-07-17T01:52:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.