Related papers: DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright Protection

DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright Protection

URL: http://arxiv.org/abs/2403.17755v1
Date: Tue, 26 Mar 2024 14:44:51 GMT
Title: DataCook: Crafting Anti-Adversarial Examples for Healthcare Data Copyright Protection
Authors: Sihan Shang, Jiancheng Yang, Zhenglong Sun, Pascal Fua,
Abstract summary: DataCook operates by "cooking" the raw data before distribution, enabling the development of models that perform normally on this processed data. During the deployment phase, the original test data must be also "cooked" through DataCook to ensure normal model performance. The mechanism behind DataCook is by crafting anti-adversarial examples (AntiAdv), which are designed to enhance model confidence.
Score: 47.91906879320081
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the realm of healthcare, the challenges of copyright protection and unauthorized third-party misuse are increasingly significant. Traditional methods for data copyright protection are applied prior to data distribution, implying that models trained on these data become uncontrollable. This paper introduces a novel approach, named DataCook, designed to safeguard the copyright of healthcare data during the deployment phase. DataCook operates by "cooking" the raw data before distribution, enabling the development of models that perform normally on this processed data. However, during the deployment phase, the original test data must be also "cooked" through DataCook to ensure normal model performance. This process grants copyright holders control over authorization during the deployment phase. The mechanism behind DataCook is by crafting anti-adversarial examples (AntiAdv), which are designed to enhance model confidence, as opposed to standard adversarial examples (Adv) that aim to confuse models. Similar to Adv, AntiAdv introduces imperceptible perturbations, ensuring that the data processed by DataCook remains easily understandable. We conducted extensive experiments on MedMNIST datasets, encompassing both 2D/3D data and the high-resolution variants. The outcomes indicate that DataCook effectively meets its objectives, preventing models trained on AntiAdv from analyzing unauthorized data effectively, without compromising the validity and accuracy of the data in legitimate scenarios. Code and data are available at https://github.com/MedMNIST/DataCook.

Related papers

DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective [59.66984417026933]
We introduce a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing)<n>We formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset.<n>Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery.<n>Our benchmark, DATABench, comprises 17 evasion attacks, 5 forgery attacks, and 9
arXiv Detail & Related papers (2025-07-08T03:07:15Z)
Unlocking Post-hoc Dataset Inference with Synthetic Data [11.886166976507711]
Training datasets are often scraped from the internet without respecting data owners' intellectual property rights.<n>Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training.<n>Existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution.<n>In this work, we address this challenge by synthetically generating the required held-out set.
arXiv Detail & Related papers (2025-06-18T08:46:59Z)
Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
Disentangled Source-Free Personalization for Facial Expression Recognition with Neutral Target Data [49.25159192831934]
Source-free domain adaptation (SFDA) methods are employed to adapt a pre-trained source model using only unlabeled target domain data. This paper introduces the Disentangled Source-Free Domain Adaptation (DSFDA) method to address the SFDA challenge posed by missing target expression data. Our method learns to disentangle features related to expressions and identity while generating the missing non-neutral target data.
arXiv Detail & Related papers (2025-03-26T17:53:53Z)
Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification [12.80649024603656]
In this paper, we adapt to audio data the recently introduced data taggants approach. Data taggants is a method to verify if a neural network was trained on a protected image dataset. We show that our method can detect the use of the dataset with high confidence without loss of performance.
arXiv Detail & Related papers (2025-03-13T11:25:25Z)
Data Taggants: Dataset Ownership Verification via Harmless Targeted Data Poisoning [12.80649024603656]
This paper introduces data taggants, a novel non-backdoor dataset ownership verification technique. We validate our approach through comprehensive and realistic experiments on ImageNet1k using ViT and ResNet models with state-of-the-art training recipes.
arXiv Detail & Related papers (2024-10-09T12:49:23Z)
CAP: Detecting Unauthorized Data Usage in Generative Models via Prompt Generation [1.6141139250981018]
Copyright Audit via Prompts generation (CAP) is a framework for automatically testing whether an ML model has been trained with unauthorized data. Specifically, we devise an approach to generate suitable keys inducing the model to reveal copyrighted contents. To prove its effectiveness, we conducted an extensive evaluation campaign on measurements collected in four IoT scenarios.
arXiv Detail & Related papers (2024-10-08T08:49:41Z)
Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora. For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination. We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z)
The Devil's Advocate: Shattering the Illusion of Unexploitable Data using Diffusion Models [14.018862290487617]
We show that a carefully designed denoising process can counteract the data-protecting perturbations. Our approach, called AVATAR, delivers state-of-the-art performance against a suite of recent availability attacks.
arXiv Detail & Related papers (2023-03-15T10:20:49Z)
Membership Inference Attacks against Synthetic Data through Overfitting Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution. We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z)
Black-box Dataset Ownership Verification via Backdoor Watermarking [67.69308278379957]
We formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model. We propose to embed external patterns via backdoor watermarking for the ownership verification to protect them. Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification.
arXiv Detail & Related papers (2022-08-04T05:32:20Z)
Distill and Fine-tune: Effective Adaptation from a Black-box Source Model [138.12678159620248]
Unsupervised domain adaptation (UDA) aims to transfer knowledge in previous related labeled datasets (source) to a new unlabeled dataset (target) We propose a novel two-step adaptation framework called Distill and Fine-tune (Dis-tune)
arXiv Detail & Related papers (2021-04-04T05:29:05Z)
Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain Adaptation [87.60688582088194]
We propose a novel Self-Supervised Noisy Label Learning method. Our method can easily achieve state-of-the-art results and surpass other methods by a very large margin.
arXiv Detail & Related papers (2021-02-23T10:51:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.