Related papers: IPProtect: protecting the intellectual property of visual datasets during data valuation

IPProtect: protecting the intellectual property of visual datasets during data valuation

URL: http://arxiv.org/abs/2212.11468v1
Date: Thu, 22 Dec 2022 03:36:19 GMT
Title: IPProtect: protecting the intellectual property of visual datasets during data valuation
Authors: Gursimran Singh, Chendi Wang, Ahnaf Tazwar, Lanjun Wang, Yong Zhang
Abstract summary: We tackle the novel task of preemptively protecting the IP of datasets that need to be shared during data valuation. First, we identify and formalize two kinds of novel IP risks in visual datasets: data-item (image) IP and statistical (dataset) IP.
Score: 8.092563412918128
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data trading is essential to accelerate the development of data-driven machine learning pipelines. The central problem in data trading is to estimate the utility of a seller's dataset with respect to a given buyer's machine learning task, also known as data valuation. Typically, data valuation requires one or more participants to share their raw dataset with others, leading to potential risks of intellectual property (IP) violations. In this paper, we tackle the novel task of preemptively protecting the IP of datasets that need to be shared during data valuation. First, we identify and formalize two kinds of novel IP risks in visual datasets: data-item (image) IP and statistical (dataset) IP. Then, we propose a novel algorithm to convert the raw dataset into a sanitized version, that provides resistance to IP violations, while at the same time allowing accurate data valuation. The key idea is to limit the transfer of information from the raw dataset to the sanitized dataset, thereby protecting against potential intellectual property violations. Next, we analyze our method for the likely existence of a solution and immunity against reconstruction attacks. Finally, we conduct extensive experiments on three computer vision datasets demonstrating the advantages of our method in comparison to other baselines.

Related papers

Sell Data to AI Algorithms Without Revealing It: Secure Data Valuation and Sharing via Homomorphic Encryption [10.12846924939717]
We introduce the Trustworthy Influence Protocol (TIP), a privacy-preserving framework that enables buyers to quantify the utility of external data without decrypting the raw assets.<n>By integrating Homomorphic Encryption with gradient-based influence functions, our approach allows for the precise, blinded scoring of data points against a buyer's specific AI model.<n> Empirical simulations in healthcare and generative AI domains validate the framework's economic potential.
arXiv Detail & Related papers (2025-12-04T16:35:09Z)
DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective [59.66984417026933]
We introduce a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing)<n>We formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset.<n>Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery.<n>Our benchmark, DATABench, comprises 17 evasion attacks, 5 forgery attacks, and 9
arXiv Detail & Related papers (2025-07-08T03:07:15Z)
Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs [67.0310240737424]
We introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs. Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset. During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs.
arXiv Detail & Related papers (2025-02-15T04:56:45Z)
Privacy Preservation through Practical Machine Unlearning [0.0]
This paper examines methods such as Naive Retraining and Exact Unlearning via the SISA framework. We explore the potential of integrating unlearning principles into Positive Unlabeled (PU) Learning to address challenges posed by partially labeled datasets.
arXiv Detail & Related papers (2025-02-15T02:25:27Z)
Privacy-Preserving Dataset Combination [1.0485433579460999]
We introduce SecureKL, a protocol for dataset-to-dataset evaluations with zero privacy leakage.<n>SecureKL evaluates a source dataset against candidates, performing dataset divergence metrics internally with private computations.<n>On real-world data, SecureKL achieves high consistency ($>90%$ correlation with non-private counterparts) and successfully identifies beneficial data collaborations.
arXiv Detail & Related papers (2025-02-09T03:54:17Z)
Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset. PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. RAG systems may face severe privacy risks when retrieving private data. We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data. Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z)
Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets. We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers. Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Secure Multiparty Computation for Synthetic Data Generation from Distributed Data [7.370727048591523]
Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education. Existing approaches assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation. We propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation.
arXiv Detail & Related papers (2022-10-13T20:09:17Z)
Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident. The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z)
PicoDomain: A Compact High-Fidelity Cybersecurity Dataset [0.9281671380673305]
Current cybersecurity datasets either offer no ground truth or do so with anonymized data. Most existing datasets are large enough to make them unwieldy during prototype development. In this paper we have developed the PicoDomain dataset, a compact high-fidelity collection of Zeek logs from a realistic intrusion.
arXiv Detail & Related papers (2020-08-20T20:18:04Z)
A Critical Overview of Privacy-Preserving Approaches for Collaborative Forecasting [0.0]
Cooperation between different data owners may lead to an improvement in forecast quality. Due to business competitive factors and personal data protection questions, said data owners might be unwilling to share their data. This paper analyses the state-of-the-art and unveils several shortcomings of existing methods in guaranteeing data privacy.
arXiv Detail & Related papers (2020-04-20T20:21:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.