Understanding and Enhancing Robustness of Concept-based Models
- URL: http://arxiv.org/abs/2211.16080v1
- Date: Tue, 29 Nov 2022 10:43:51 GMT
- Title: Understanding and Enhancing Robustness of Concept-based Models
- Authors: Sanchit Sinha, Mengdi Huai, Jianhui Sun, Aidong Zhang
- Abstract summary: We study robustness of concept-based models to adversarial perturbations.
In this paper, we first propose and analyze different malicious attacks to evaluate the security vulnerability of concept based models.
We then propose a potential general adversarial training-based defense mechanism to increase robustness of these systems to the proposed malicious attacks.
- Score: 41.20004311158688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rising usage of deep neural networks to perform decision making in critical
applications like medical diagnosis and financial analysis have raised concerns
regarding their reliability and trustworthiness. As automated systems become
more mainstream, it is important their decisions be transparent, reliable and
understandable by humans for better trust and confidence. To this effect,
concept-based models such as Concept Bottleneck Models (CBMs) and
Self-Explaining Neural Networks (SENN) have been proposed which constrain the
latent space of a model to represent high level concepts easily understood by
domain experts in the field. Although concept-based models promise a good
approach to both increasing explainability and reliability, it is yet to be
shown if they demonstrate robustness and output consistent concepts under
systematic perturbations to their inputs. To better understand performance of
concept-based models on curated malicious samples, in this paper, we aim to
study their robustness to adversarial perturbations, which are also known as
the imperceptible changes to the input data that are crafted by an attacker to
fool a well-learned concept-based model. Specifically, we first propose and
analyze different malicious attacks to evaluate the security vulnerability of
concept based models. Subsequently, we propose a potential general adversarial
training-based defense mechanism to increase robustness of these systems to the
proposed malicious attacks. Extensive experiments on one synthetic and two
real-world datasets demonstrate the effectiveness of the proposed attacks and
the defense approach.
Related papers
- A Framework for Strategic Discovery of Credible Neural Network Surrogate Models under Uncertainty [0.0]
This study presents the Occam Plausibility Algorithm for surrogate models (OPAL-surrogate)
OPAL-surrogate provides a systematic framework to uncover predictive neural network-based surrogate models.
It balances the trade-off between model complexity, accuracy, and prediction uncertainty.
arXiv Detail & Related papers (2024-03-13T18:45:51Z) - NeuralSentinel: Safeguarding Neural Network Reliability and
Trustworthiness [0.0]
We present NeuralSentinel (NS), a tool able to validate the reliability and trustworthiness of AI models.
NS help non-expert staff increase their confidence in this new system by understanding the model decisions.
This tool was deployed and used in a Hackathon event to evaluate the reliability of a skin cancer image detector.
arXiv Detail & Related papers (2024-02-12T09:24:34Z) - Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models? [52.238883592674696]
Ring-A-Bell is a model-agnostic red-teaming tool for T2I diffusion models.
It identifies problematic prompts for diffusion models with the corresponding generation of inappropriate content.
Our results show that Ring-A-Bell, by manipulating safe prompting benchmarks, can transform prompts that were originally regarded as safe to evade existing safety mechanisms.
arXiv Detail & Related papers (2023-10-16T02:11:20Z) - Boosting Adversarial Robustness using Feature Level Stochastic Smoothing [46.86097477465267]
adversarial defenses have led to a significant improvement in the robustness of Deep Neural Networks.
In this work, we propose a generic method for introducingity in the network predictions.
We also utilize this for smoothing decision rejecting low confidence predictions.
arXiv Detail & Related papers (2023-06-10T15:11:24Z) - Concept Embedding Models [27.968589555078328]
Concept bottleneck models promote trustworthiness by conditioning classification tasks on an intermediate level of human-like concepts.
Existing concept bottleneck models are unable to find optimal compromises between high task accuracy, robust concept-based explanations, and effective interventions on concepts.
We propose Concept Embedding Models, a novel family of concept bottleneck models which goes beyond the current accuracy-vs-interpretability trade-off by learning interpretable high-dimensional concept representations.
arXiv Detail & Related papers (2022-09-19T14:49:36Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - A Unified Contrastive Energy-based Model for Understanding the
Generative Ability of Adversarial Training [64.71254710803368]
Adversarial Training (AT) is an effective approach to enhance the robustness of deep neural networks.
We demystify this phenomenon by developing a unified probabilistic framework, called Contrastive Energy-based Models (CEM)
We propose a principled method to develop adversarial learning and sampling methods.
arXiv Detail & Related papers (2022-03-25T05:33:34Z) - A Comprehensive Evaluation Framework for Deep Model Robustness [44.20580847861682]
Deep neural networks (DNNs) have achieved remarkable performance across a wide area of applications.
They are vulnerable to adversarial examples, which motivates the adversarial defense.
This paper presents a model evaluation framework containing a comprehensive, rigorous, and coherent set of evaluation metrics.
arXiv Detail & Related papers (2021-01-24T01:04:25Z) - Trust but Verify: Assigning Prediction Credibility by Counterfactual
Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning.
These measures should account for the wide variety of models used in practice.
The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z) - A general framework for defining and optimizing robustness [74.67016173858497]
We propose a rigorous and flexible framework for defining different types of robustness properties for classifiers.
Our concept is based on postulates that robustness of a classifier should be considered as a property that is independent of accuracy.
We develop a very general robustness framework that is applicable to any type of classification model.
arXiv Detail & Related papers (2020-06-19T13:24:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.