Related papers: Radioactive data: tracing through training

Radioactive data: tracing through training

URL: http://arxiv.org/abs/2002.00937v1
Date: Mon, 3 Feb 2020 18:41:08 GMT
Title: Radioactive data: tracing through training
Authors: Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, Herv\'e J\'egou
Abstract summary: We propose a new technique, emphradioactive data, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark. Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value) Our method is robust to data augmentation and backdoority of deep network optimization.
Score: 130.2266320167683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We want to detect whether a particular image dataset has been used to train a model. We propose a new technique, \emph{radioactive data}, that makes imperceptible changes to this dataset such that any model trained on it will bear an identifiable mark. The mark is robust to strong variations such as different architectures or optimization methods. Given a trained model, our technique detects the use of radioactive data and provides a level of confidence (p-value). Our experiments on large-scale benchmarks (Imagenet), using standard architectures (Resnet-18, VGG-16, Densenet-121) and training procedures, show that we can detect usage of radioactive data with high confidence (p<10^-4) even when only 1% of the data used to trained our model is radioactive. Our method is robust to data augmentation and the stochasticity of deep network optimization. As a result, it offers a much higher signal-to-noise ratio than data poisoning and backdoor methods.

Related papers

Targeted Data Poisoning for Black-Box Audio Datasets Ownership Verification [12.80649024603656]
In this paper, we adapt to audio data the recently introduced data taggants approach. Data taggants is a method to verify if a neural network was trained on a protected image dataset. We show that our method can detect the use of the dataset with high confidence without loss of performance.
arXiv Detail & Related papers (2025-03-13T11:25:25Z)
Unsupervised Anomaly Detection for Tabular Data Using Noise Evaluation [26.312206159418903]
Unsupervised anomaly detection (UAD) plays an important role in modern data analytics. We present a novel UAD method by evaluating how much noise is in the data. We provide theoretical guarantees, proving that the proposed method can detect anomalous data successfully.
arXiv Detail & Related papers (2024-12-16T05:35:58Z)
Learning from Convolution-based Unlearnable Datastes [5.332412565926725]
The Conlearn-based Unlearnable DAtaset (CUDA) method aims to make data unlearnable by applying class-wise blurs to every image in the dataset. In this work, we evaluate whether data remains unlearnable after image sharpening and frequency filtering. We observe a substantial increase in test accuracy over adversarial training for models trained with unlearnable data.
arXiv Detail & Related papers (2024-11-04T01:51:50Z)
Improved detection of discarded fish species through BoxAL active learning [0.2544632696242629]
In this study, we present an active learning technique, named BoxAL, which includes estimation of epistemic certainty of the Faster R-CNN object-detection model. The method allows selecting the most uncertain training images from an unlabeled pool, which are then used to train the object-detection model. Our study additionally showed that the sampled new data is more valuable for training than the remaining unlabeled data.
arXiv Detail & Related papers (2024-10-07T10:01:30Z)
Data-Independent Operator: A Training-Free Artifact Representation Extractor for Generalizable Deepfake Detection [105.9932053078449]
In this work, we show that, on the contrary, the small and training-free filter is sufficient to capture more general artifact representations. Due to its unbias towards both the training and test sources, we define it as Data-Independent Operator (DIO) to achieve appealing improvements on unseen sources. Our detector achieves a remarkable improvement of $13.3%$, establishing a new state-of-the-art performance.
arXiv Detail & Related papers (2024-03-11T15:22:28Z)
DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection [41.436817746749384]
Diffusion Model is a scalable data engine for object detection. DiffusionEngine (DE) provides high-quality detection-oriented training pairs in a single stage.
arXiv Detail & Related papers (2023-09-07T17:55:01Z)
Exploring Data Redundancy in Real-world Image Classification through Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs. We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data. Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z)
Anomaly Detection with Ensemble of Encoder and Decoder [2.8199078343161266]
Anomaly detection in power grids aims to detect and discriminate anomalies caused by cyber attacks against the power system. We propose a novel anomaly detection method by modeling the data distribution of normal samples via multiple encoders and decoders. Experiment results on network intrusion and power system datasets demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-03-11T15:49:29Z)
Decision Forest Based EMG Signal Classification with Low Volume Dataset Augmented with Random Variance Gaussian Noise [51.76329821186873]
We produce a model that can classify six different hand gestures with a limited number of samples that generalizes well to a wider audience. We appeal to a set of more elementary methods such as the use of random bounds on a signal, but desire to show the power these methods can carry in an online setting.
arXiv Detail & Related papers (2022-06-29T23:22:18Z)
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID) Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance. This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.