Generalized but not Robust? Comparing the Effects of Data Modification
Methods on Out-of-Domain Generalization and Adversarial Robustness
- URL: http://arxiv.org/abs/2203.07653v1
- Date: Tue, 15 Mar 2022 05:32:44 GMT
- Title: Generalized but not Robust? Comparing the Effects of Data Modification
Methods on Out-of-Domain Generalization and Adversarial Robustness
- Authors: Tejas Gokhale, Swaroop Mishra, Man Luo, Bhavdeep Singh Sachdeva and
Chitta Baral
- Abstract summary: We study common data modification strategies and evaluate their in-domain and adversarial robustness.
Our findings suggest that more data (either via additional datasets or data augmentation) benefits both OOD accuracy and AR.
However, data filtering hurts OOD accuracy on other tasks such as question answering and image classification.
- Score: 27.868217989276797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data modification, either via additional training datasets, data
augmentation, debiasing, and dataset filtering, has been proposed as an
effective solution for generalizing to out-of-domain (OOD) inputs, in both
natural language processing and computer vision literature. However, the effect
of data modification on adversarial robustness remains unclear. In this work,
we conduct a comprehensive study of common data modification strategies and
evaluate not only their in-domain and OOD performance, but also their
adversarial robustness (AR). We also present results on a two-dimensional
synthetic dataset to visualize the effect of each method on the training
distribution. This work serves as an empirical study towards understanding the
relationship between generalizing to unseen domains and defending against
adversarial perturbations. Our findings suggest that more data (either via
additional datasets or data augmentation) benefits both OOD accuracy and AR.
However, data filtering (previously shown to improve OOD accuracy on natural
language inference) hurts OOD accuracy on other tasks such as question
answering and image classification. We provide insights from our experiments to
inform future work in this direction.
Related papers
- RICASSO: Reinforced Imbalance Learning with Class-Aware Self-Supervised Outliers Exposure [21.809270017579806]
Deep learning models often face challenges from both imbalanced (long-tailed) and out-of-distribution (OOD) data.
Our research shows that data mixing can generate pseudo-OOD data that exhibit the features of both in-distribution (ID) data and OOD data.
We propose a unified framework called Reinforced Imbalance Learning with Class-Aware Self-Supervised Outliers Exposure (RICASSO)
arXiv Detail & Related papers (2024-10-14T14:29:32Z) - How Data Inter-connectivity Shapes LLMs Unlearning: A Structural Unlearning Perspective [29.924482732745954]
Existing approaches assume data points to-be-forgotten are independent, ignoring their inter-connectivity.
We propose PISTOL, a method for compiling structural datasets.
arXiv Detail & Related papers (2024-06-24T17:22:36Z) - PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning [49.60634126342945]
Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes.
Recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information.
We employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues.
arXiv Detail & Related papers (2024-06-09T07:29:55Z) - Clarifying Myths About the Relationship Between Shape Bias, Accuracy, and Robustness [18.55761892159021]
Deep learning models can perform well when evaluated on images from the same distribution as the training set.
Deep learning models can perform well when evaluated on images from the same distribution as the training set.
Applying small blurrings to a model's input image and feeding the model with out-of-distribution (OOD) data can significantly drop the model's accuracy.
Data augmentation is one of the well-practiced methods to improve model robustness against OOD data.
arXiv Detail & Related papers (2024-06-07T15:21:00Z) - Mixture Data for Training Cannot Ensure Out-of-distribution Generalization [21.801115344132114]
We show that increasing the size of training data does not always lead to a reduction in the test generalization error.
In this work, we quantitatively redefine OOD data as those situated outside the convex hull of mixed training data.
Our proof of the new risk bound agrees that the efficacy of well-trained models can be guaranteed for unseen data.
arXiv Detail & Related papers (2023-12-25T11:00:38Z) - Reliability in Semantic Segmentation: Can We Use Synthetic Data? [69.28268603137546]
We show for the first time how synthetic data can be specifically generated to assess comprehensively the real-world reliability of semantic segmentation models.
This synthetic data is employed to evaluate the robustness of pretrained segmenters.
We demonstrate how our approach can be utilized to enhance the calibration and OOD detection capabilities of segmenters.
arXiv Detail & Related papers (2023-12-14T18:56:07Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Out-of-distribution Detection with Implicit Outlier Transformation [72.73711947366377]
Outlier exposure (OE) is powerful in out-of-distribution (OOD) detection.
We propose a novel OE-based approach that makes the model perform well for unseen OOD situations.
arXiv Detail & Related papers (2023-03-09T04:36:38Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Harnessing Out-Of-Distribution Examples via Augmenting Content and Style [93.21258201360484]
Machine learning models are vulnerable to Out-Of-Distribution (OOD) examples.
This paper proposes a HOOD method that can leverage the content and style from each image instance to identify benign and malign OOD data.
Thanks to the proposed novel disentanglement and data augmentation techniques, HOOD can effectively deal with OOD examples in unknown and open environments.
arXiv Detail & Related papers (2022-07-07T08:48:59Z) - Learning Infomax and Domain-Independent Representations for Causal
Effect Inference with Real-World Data [9.601837205635686]
We learn the Infomax and Domain-Independent Representations to solve the above puzzles.
We show that our method achieves state-of-the-art performance on causal effect inference.
arXiv Detail & Related papers (2022-02-22T13:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.