Related papers: Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation

URL: http://arxiv.org/abs/2406.13283v2
Date: Thu, 11 Jul 2024 17:10:24 GMT
Title: Large-Scale Dataset Pruning in Adversarial Training through Data Importance Extrapolation
Authors: Björn Nieth, Thomas Altstidl, Leo Schwinn, Björn Eskofier,
Abstract summary: We propose a new data pruning strategy based on extrapolating data importance scores from a small set of data to a larger set. In an empirical evaluation, we demonstrate that extrapolation-based pruning can efficiently reduce dataset size while maintaining robustness.
Score: 1.3124513975412255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Their vulnerability to small, imperceptible attacks limits the adoption of deep learning models to real-world systems. Adversarial training has proven to be one of the most promising strategies against these attacks, at the expense of a substantial increase in training time. With the ongoing trend of integrating large-scale synthetic data this is only expected to increase even further. Thus, the need for data-centric approaches that reduce the number of training samples while maintaining accuracy and robustness arises. While data pruning and active learning are prominent research topics in deep learning, they are as of now largely unexplored in the adversarial training literature. We address this gap and propose a new data pruning strategy based on extrapolating data importance scores from a small set of data to a larger set. In an empirical evaluation, we demonstrate that extrapolation-based pruning can efficiently reduce dataset size while maintaining robustness.

Related papers

Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z)
Less is More: Adaptive Coverage for Synthetic Training Data [20.136698279893857]
This study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset.
arXiv Detail & Related papers (2025-04-20T06:45:16Z)
Granularity Matters in Long-Tail Learning [62.30734737735273]
We offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. We introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss.
arXiv Detail & Related papers (2024-10-21T13:06:21Z)
Sexism Detection on a Data Diet [14.899608305188002]
We show how we can leverage influence scores to estimate the importance of a data point while training a model. We evaluate the model performance trained on data pruned with different pruning strategies on three out-of-domain datasets.
arXiv Detail & Related papers (2024-06-07T12:39:54Z)
An In-Depth Analysis of Data Reduction Methods for Sustainable Deep Learning [0.15833270109954137]
We present up to eight different methods to reduce the size of a training dataset. We also develop a Python package to apply them. We experimentally compare how these data reduction methods affect the representativeness of the reduced dataset.
arXiv Detail & Related papers (2024-03-22T12:06:40Z)
Small Dataset, Big Gains: Enhancing Reinforcement Learning by Offline Pre-Training with Model Based Augmentation [59.899714450049494]
offline pre-training can produce sub-optimal policies and lead to degraded online reinforcement learning performance. We propose a model-based data augmentation strategy to maximize the benefits of offline reinforcement learning pre-training and reduce the scale of data needed to be effective.
arXiv Detail & Related papers (2023-12-15T14:49:41Z)
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
Re-thinking Data Availablity Attacks Against Deep Neural Networks [53.64624167867274]
In this paper, we re-examine the concept of unlearnable examples and discern that the existing robust error-minimizing noise presents an inaccurate optimization objective. We introduce a novel optimization paradigm that yields improved protection results with reduced computational time requirements.
arXiv Detail & Related papers (2023-05-18T04:03:51Z)
Reconstructing Training Data from Model Gradient, Provably [68.21082086264555]
We reconstruct the training samples from a single gradient query at a randomly chosen parameter value. As a provable attack that reveals sensitive training data, our findings suggest potential severe threats to privacy.
arXiv Detail & Related papers (2022-12-07T15:32:22Z)
Membership Inference Attacks via Adversarial Examples [5.721380617450644]
Membership inference attacks are a novel direction of research which aims at recovering training data used by a learning algorithm. We develop a mean to measure the leakage of training data leveraging a quantity appearing as a proxy of the total variation of a trained model.
arXiv Detail & Related papers (2022-07-27T15:10:57Z)
A Deep-Learning Intelligent System Incorporating Data Augmentation for Short-Term Voltage Stability Assessment of Power Systems [9.299576471941753]
This paper proposes a novel deep-learning intelligent system incorporating data augmentation for STVSA of power systems. Due to the unavailability of reliable quantitative criteria to judge the stability status for a specific power system, semi-supervised cluster learning is leveraged to obtain labeled samples. conditional least squares generative adversarial networks (LSGAN)-based data augmentation is introduced to expand the original dataset.
arXiv Detail & Related papers (2021-12-05T11:40:54Z)
The Imaginative Generative Adversarial Network: Automatic Data Augmentation for Dynamic Skeleton-Based Hand Gesture and Human Action Recognition [27.795763107984286]
We present a novel automatic data augmentation model, which approximates the distribution of the input data and samples new data from this distribution. Our results show that the augmentation strategy is fast to train and can improve classification accuracy for both neural networks and state-of-the-art methods.
arXiv Detail & Related papers (2021-05-27T11:07:09Z)
Provably Efficient Causal Reinforcement Learning with Confounded Observational Data [135.64775986546505]
We study how to incorporate the dataset (observational data) collected offline, which is often abundantly available in practice, to improve the sample efficiency in the online setting. We propose the deconfounded optimistic value iteration (DOVI) algorithm, which incorporates the confounded observational data in a provably efficient manner.
arXiv Detail & Related papers (2020-06-22T14:49:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.