On the Safety of Interpretable Machine Learning: A Maximum Deviation
Approach
- URL: http://arxiv.org/abs/2211.01498v1
- Date: Wed, 2 Nov 2022 21:57:24 GMT
- Title: On the Safety of Interpretable Machine Learning: A Maximum Deviation
Approach
- Authors: Dennis Wei, Rahul Nair, Amit Dhurandhar, Kush R. Varshney, Elizabeth
M. Daly, Moninder Singh
- Abstract summary: Interpretable and explainable machine learning has seen a recent surge of interest.
We focus on safety as a key motivation behind the surge and make the relationship between interpretability and safety more quantitative.
We present case studies, including one on mortgage approval, to illustrate our methods and the insights about models that may be obtained from deviation.
- Score: 42.31002956593477
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Interpretable and explainable machine learning has seen a recent surge of
interest. We focus on safety as a key motivation behind the surge and make the
relationship between interpretability and safety more quantitative. Toward
assessing safety, we introduce the concept of maximum deviation via an
optimization problem to find the largest deviation of a supervised learning
model from a reference model regarded as safe. We then show how
interpretability facilitates this safety assessment. For models including
decision trees, generalized linear and additive models, the maximum deviation
can be computed exactly and efficiently. For tree ensembles, which are not
regarded as interpretable, discrete optimization techniques can still provide
informative bounds. For a broader class of piecewise Lipschitz functions, we
leverage the multi-armed bandit literature to show that interpretability
produces tighter (regret) bounds on the maximum deviation. We present case
studies, including one on mortgage approval, to illustrate our methods and the
insights about models that may be obtained from deviation maximization.
Related papers
- Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level [10.658844160259104]
Large language models (LLMs) have demonstrated immense utility across various industries.
As LLMs advance, the risk of harmful outputs increases due to incorrect or malicious instruction prompts.
This paper examines the LLMs' capability to recognize harmful outputs, revealing and quantifying their proficiency in assessing the danger of previous tokens.
arXiv Detail & Related papers (2024-10-09T12:09:30Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based scenarios.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Towards Precise Observations of Neural Model Robustness in Classification [2.127049691404299]
In deep learning applications, robustness measures the ability of neural models that handle slight changes in input data.
Our approach contributes to a deeper understanding of model robustness in safety-critical applications.
arXiv Detail & Related papers (2024-04-25T09:37:44Z) - Nevermind: Instruction Override and Moderation in Large Language Models [2.0935496890864207]
We investigate and benchmark the most popular proprietary and different sized open source models on the task of explicit instruction following in conflicting situations.
We observe improving instruction following, and subsequently instruction overrides/jailbreaks, is fundamentally at odds with the ability of a language model to follow given safety filters or guidelines.
arXiv Detail & Related papers (2024-02-05T18:58:19Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - Safe MDP Planning by Learning Temporal Patterns of Undesirable
Trajectories and Averting Negative Side Effects [27.41101006357176]
In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects.
operating based on an incomplete model can often produce unintended negative side effects (NSEs)
arXiv Detail & Related papers (2023-04-06T14:03:24Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Log Barriers for Safe Black-box Optimization with Application to Safe
Reinforcement Learning [72.97229770329214]
We introduce a general approach for seeking high dimensional non-linear optimization problems in which maintaining safety during learning is crucial.
Our approach called LBSGD is based on applying a logarithmic barrier approximation with a carefully chosen step size.
We demonstrate the effectiveness of our approach on minimizing violation in policy tasks in safe reinforcement learning.
arXiv Detail & Related papers (2022-07-21T11:14:47Z) - Maximum Likelihood Constraint Inference from Stochastic Demonstrations [5.254702845143088]
This paper extends maximum likelihood constraint inference to applications by using maximum causal entropy likelihoods.
We propose an efficient algorithm that computes constraint likelihood and risk tolerance in a unified Bellman backup.
arXiv Detail & Related papers (2021-02-24T20:46:55Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.