IPProtect: protecting the intellectual property of visual datasets
during data valuation
- URL: http://arxiv.org/abs/2212.11468v1
- Date: Thu, 22 Dec 2022 03:36:19 GMT
- Title: IPProtect: protecting the intellectual property of visual datasets
during data valuation
- Authors: Gursimran Singh, Chendi Wang, Ahnaf Tazwar, Lanjun Wang, Yong Zhang
- Abstract summary: We tackle the novel task of preemptively protecting the IP of datasets that need to be shared during data valuation.
First, we identify and formalize two kinds of novel IP risks in visual datasets: data-item (image) IP and statistical (dataset) IP.
- Score: 8.092563412918128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data trading is essential to accelerate the development of data-driven
machine learning pipelines. The central problem in data trading is to estimate
the utility of a seller's dataset with respect to a given buyer's machine
learning task, also known as data valuation. Typically, data valuation requires
one or more participants to share their raw dataset with others, leading to
potential risks of intellectual property (IP) violations. In this paper, we
tackle the novel task of preemptively protecting the IP of datasets that need
to be shared during data valuation. First, we identify and formalize two kinds
of novel IP risks in visual datasets: data-item (image) IP and statistical
(dataset) IP. Then, we propose a novel algorithm to convert the raw dataset
into a sanitized version, that provides resistance to IP violations, while at
the same time allowing accurate data valuation. The key idea is to limit the
transfer of information from the raw dataset to the sanitized dataset, thereby
protecting against potential intellectual property violations. Next, we analyze
our method for the likely existence of a solution and immunity against
reconstruction attacks. Finally, we conduct extensive experiments on three
computer vision datasets demonstrating the advantages of our method in
comparison to other baselines.
Related papers
- Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Truthful Dataset Valuation by Pointwise Mutual Information [28.63827288801458]
We propose a new data valuation method that provably guarantees the following: data providers always maximize their expected score by truthfully reporting their observed data.
Our method, following the paradigm of proper scoring rules, measures the pointwise mutual information (PMI) of the test dataset and the evaluated dataset.
arXiv Detail & Related papers (2024-05-28T15:04:17Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Secure Multiparty Computation for Synthetic Data Generation from
Distributed Data [7.370727048591523]
Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education.
Existing approaches assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation.
We propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation.
arXiv Detail & Related papers (2022-10-13T20:09:17Z) - Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z) - PicoDomain: A Compact High-Fidelity Cybersecurity Dataset [0.9281671380673305]
Current cybersecurity datasets either offer no ground truth or do so with anonymized data.
Most existing datasets are large enough to make them unwieldy during prototype development.
In this paper we have developed the PicoDomain dataset, a compact high-fidelity collection of Zeek logs from a realistic intrusion.
arXiv Detail & Related papers (2020-08-20T20:18:04Z) - A Critical Overview of Privacy-Preserving Approaches for Collaborative
Forecasting [0.0]
Cooperation between different data owners may lead to an improvement in forecast quality.
Due to business competitive factors and personal data protection questions, said data owners might be unwilling to share their data.
This paper analyses the state-of-the-art and unveils several shortcomings of existing methods in guaranteeing data privacy.
arXiv Detail & Related papers (2020-04-20T20:21:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.