Radioactive data: tracing through training
- URL: http://arxiv.org/abs/2002.00937v1
- Date: Mon, 3 Feb 2020 18:41:08 GMT
- Title: Radioactive data: tracing through training
- Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Herv\'e
J\'egou
- Abstract summary: We propose a new technique, emphradioactive data, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark.
Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value)
Our method is robust to data augmentation and backdoority of deep network optimization.
- Score: 130.2266320167683
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We want to detect whether a particular image dataset has been used to train a
model. We propose a new technique, \emph{radioactive data}, that makes
imperceptible changes to this dataset such that any model trained on it will
bear an identifiable mark. The mark is robust to strong variations such as
different architectures or optimization methods. Given a trained model, our
technique detects the use of radioactive data and provides a level of
confidence (p-value). Our experiments on large-scale benchmarks (Imagenet),
using standard architectures (Resnet-18, VGG-16, Densenet-121) and training
procedures, show that we can detect usage of radioactive data with high
confidence (p<10^-4) even when only 1% of the data used to trained our model is
radioactive. Our method is robust to data augmentation and the stochasticity of
deep network optimization. As a result, it offers a much higher signal-to-noise
ratio than data poisoning and backdoor methods.
Related papers
- Data-Independent Operator: A Training-Free Artifact Representation
Extractor for Generalizable Deepfake Detection [105.9932053078449]
In this work, we show that, on the contrary, the small and training-free filter is sufficient to capture more general artifact representations.
Due to its unbias towards both the training and test sources, we define it as Data-Independent Operator (DIO) to achieve appealing improvements on unseen sources.
Our detector achieves a remarkable improvement of $13.3%$, establishing a new state-of-the-art performance.
arXiv Detail & Related papers (2024-03-11T15:22:28Z) - DiffusionEngine: Diffusion Model is Scalable Data Engine for Object
Detection [41.436817746749384]
Diffusion Model is a scalable data engine for object detection.
DiffusionEngine (DE) provides high-quality detection-oriented training pairs in a single stage.
arXiv Detail & Related papers (2023-09-07T17:55:01Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Anomaly Detection with Ensemble of Encoder and Decoder [2.8199078343161266]
Anomaly detection in power grids aims to detect and discriminate anomalies caused by cyber attacks against the power system.
We propose a novel anomaly detection method by modeling the data distribution of normal samples via multiple encoders and decoders.
Experiment results on network intrusion and power system datasets demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-03-11T15:49:29Z) - Decision Forest Based EMG Signal Classification with Low Volume Dataset
Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience.
We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - The Devil Is in the Details: An Efficient Convolutional Neural Network
for Transport Mode Detection [3.008051369744002]
Transport mode detection is a classification problem aiming to design an algorithm that can infer the transport mode of a user given multimodal signals.
We show that a small, optimized model can perform as well as a current deep model.
arXiv Detail & Related papers (2021-09-16T08:05:47Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - DataLoc+: A Data Augmentation Technique for Machine Learning in
Room-Level Indoor Localization [0.6961253535504979]
We propose DataLoc+, a data augmentation technique for room-level indoor localization.
We evaluate the technique by comparing it to the typical direct snapshot approach using data collected from a field experiment conducted in a hospital.
arXiv Detail & Related papers (2021-01-21T17:41:41Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.