Failures Are Fated, But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models
- URL: http://arxiv.org/abs/2406.07145v2
- Date: Thu, 13 Jun 2024 03:58:32 GMT
- Title: Failures Are Fated, But Can Be Faded: Characterizing and Mitigating Unwanted Behaviors in Large-Scale Vision and Language Models
- Authors: Som Sagar, Aditya Taparia, Ransalu Senanayake,
- Abstract summary: In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values.
We introduce a post-hoc method that utilizes emphdeep reinforcement learning to explore and construct the landscape of failure modes in pre-trained discriminative and generative models.
We empirically show the effectiveness of the proposed method across common Computer Vision, Natural Language Processing, and Vision-Language tasks.
- Score: 7.736445799116692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In large deep neural networks that seem to perform surprisingly well on many tasks, we also observe a few failures related to accuracy, social biases, and alignment with human values, among others. Therefore, before deploying these models, it is crucial to characterize this failure landscape for engineers to debug and legislative bodies to audit models. Nevertheless, it is infeasible to exhaustively test for all possible combinations of factors that could lead to a model's failure. In this paper, we introduce a post-hoc method that utilizes \emph{deep reinforcement learning} to explore and construct the landscape of failure modes in pre-trained discriminative and generative models. With the aid of limited human feedback, we then demonstrate how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes. We empirically show the effectiveness of the proposed method across common Computer Vision, Natural Language Processing, and Vision-Language tasks.
Related papers
- LLM-Assisted Red Teaming of Diffusion Models through "Failures Are Fated, But Can Be Faded" [7.736445799116692]
"Failures are fated, but can be faded" is a framework to explore and construct the failure landscape in pre-trained generative models.
We show how to restructure the failure landscape to be more desirable by moving away from the discovered failure modes.
arXiv Detail & Related papers (2024-10-22T06:46:09Z) - Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - What could go wrong? Discovering and describing failure modes in computer vision [27.6114923305978]
We formalize the problem of Language-Based Error Explainability (LBEE)
We propose solutions that operate in a joint vision-and-language embedding space.
We show that the proposed methodology isolates nontrivial sentences associated with specific error causes.
arXiv Detail & Related papers (2024-08-08T14:01:12Z) - Partially Recentralization Softmax Loss for Vision-Language Models Robustness [8.78222772167501]
We study the adversarial robustness provided by modifying loss function of pre-trained multimodal models.
Our experiments show that after a fine-tuning, adversarial robustness of pre-trained models can be significantly improved, against popular attacks.
arXiv Detail & Related papers (2024-02-06T01:44:38Z) - Identifying and Mitigating Model Failures through Few-shot CLIP-aided
Diffusion Generation [65.268245109828]
We propose an end-to-end framework to generate text descriptions of failure modes associated with spurious correlations.
These descriptions can be used to generate synthetic data using generative models, such as diffusion models.
Our experiments have shown remarkable textbfimprovements in accuracy ($sim textbf21%$) on hard sub-populations.
arXiv Detail & Related papers (2023-12-09T04:43:49Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - A Survey on Transferability of Adversarial Examples across Deep Neural Networks [53.04734042366312]
adversarial examples can manipulate machine learning models into making erroneous predictions.
The transferability of adversarial examples enables black-box attacks which circumvent the need for detailed knowledge of the target model.
This survey explores the landscape of the adversarial transferability of adversarial examples.
arXiv Detail & Related papers (2023-10-26T17:45:26Z) - Human-Understandable Decision Making for Visual Recognition [30.30163407674527]
We propose a new framework to train a deep neural network by incorporating the prior of human perception into the model learning process.
The effectiveness of our proposed model is evaluated on two classical visual recognition tasks.
arXiv Detail & Related papers (2021-03-05T02:07:33Z) - Plausible Counterfactuals: Auditing Deep Learning Classifiers with
Realistic Adversarial Examples [84.8370546614042]
Black-box nature of Deep Learning models has posed unanswered questions about what they learn from data.
Generative Adversarial Network (GAN) and multi-objectives are used to furnish a plausible attack to the audited model.
Its utility is showcased within a human face classification task, unveiling the enormous potential of the proposed framework.
arXiv Detail & Related papers (2020-03-25T11:08:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.