Radioactive data: tracing through training
- URL: http://arxiv.org/abs/2002.00937v1
- Date: Mon, 3 Feb 2020 18:41:08 GMT
- Title: Radioactive data: tracing through training
- Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Herv\'e
J\'egou
- Abstract summary: We propose a new technique, emphradioactive data, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark.
Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value)
Our method is robust to data augmentation and backdoority of deep network optimization.
- Score: 130.2266320167683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We want to detect whether a particular image dataset has been used to train a
model. We propose a new technique, \emph{radioactive data}, that makes
imperceptible changes to this dataset such that any model trained on it will
bear an identifiable mark. The mark is robust to strong variations such as
different architectures or optimization methods. Given a trained model, our
technique detects the use of radioactive data and provides a level of
confidence (p-value). Our experiments on large-scale benchmarks (Imagenet),
using standard architectures (Resnet-18, VGG-16, Densenet-121) and training
procedures, show that we can detect usage of radioactive data with high
confidence (p<10^-4) even when only 1% of the data used to trained our model is
radioactive. Our method is robust to data augmentation and the stochasticity of
deep network optimization. As a result, it offers a much higher signal-to-noise
ratio than data poisoning and backdoor methods.
Related papers
- Learning from Convolution-based Unlearnable Datastes [5.332412565926725]
The Conlearn-based Unlearnable DAtaset (CUDA) method aims to make data unlearnable by applying class-wise blurs to every image in the dataset.
In this work, we evaluate whether data remains unlearnable after image sharpening and frequency filtering.
We observe a substantial increase in test accuracy over adversarial training for models trained with unlearnable data.
arXiv Detail & Related papers (2024-11-04T01:51:50Z) - Improved detection of discarded fish species through BoxAL active learning [0.2544632696242629]
In this study, we present an active learning technique, named BoxAL, which includes estimation of epistemic certainty of the Faster R-CNN object-detection model.
The method allows selecting the most uncertain training images from an unlabeled pool, which are then used to train the object-detection model.
Our study additionally showed that the sampled new data is more valuable for training than the remaining unlabeled data.
arXiv Detail & Related papers (2024-10-07T10:01:30Z) - Data-Independent Operator: A Training-Free Artifact Representation
Extractor for Generalizable Deepfake Detection [105.9932053078449]
In this work, we show that, on the contrary, the small and training-free filter is sufficient to capture more general artifact representations.
Due to its unbias towards both the training and test sources, we define it as Data-Independent Operator (DIO) to achieve appealing improvements on unseen sources.
Our detector achieves a remarkable improvement of $13.3%$, establishing a new state-of-the-art performance.
arXiv Detail & Related papers (2024-03-11T15:22:28Z) - DiffusionEngine: Diffusion Model is Scalable Data Engine for Object
Detection [41.436817746749384]
Diffusion Model is a scalable data engine for object detection.
DiffusionEngine (DE) provides high-quality detection-oriented training pairs in a single stage.
arXiv Detail & Related papers (2023-09-07T17:55:01Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Anomaly Detection with Ensemble of Encoder and Decoder [2.8199078343161266]
Anomaly detection in power grids aims to detect and discriminate anomalies caused by cyber attacks against the power system.
We propose a novel anomaly detection method by modeling the data distribution of normal samples via multiple encoders and decoders.
Experiment results on network intrusion and power system datasets demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-03-11T15:49:29Z) - Decision Forest Based EMG Signal Classification with Low Volume Dataset
Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience.
We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.