Related papers: Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas

URL: http://arxiv.org/abs/2505.14633v1
Date: Tue, 20 May 2025 17:24:09 GMT
Title: Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Authors: Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger,
Abstract summary: Inspired by how risky behaviors in humans are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors.<n>We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes.<n>We show that values in LitmusValues can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
Score: 34.90544849649325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.

Related papers

Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing [55.2480439325792]
This study systematically evaluations twelve state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation dataset.<n>Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models.
arXiv Detail & Related papers (2025-02-21T18:45:37Z)
Statistical Scenario Modelling and Lookalike Distributions for Multi-Variate AI Risk [0.6526824510982799]
We show how scenario modelling can be used to model AI risk holistically.<n>We show how lookalike distributions from phenomena analogous to AI can be used to estimate AI impacts in the absence of directly observable data.
arXiv Detail & Related papers (2025-02-20T12:14:54Z)
Fully Autonomous AI Agents Should Not be Developed [58.88624302082713]
This paper argues that fully autonomous AI agents should not be developed.<n>In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels.<n>Our analysis reveals that risks to people increase with the autonomy of a system.
arXiv Detail & Related papers (2025-02-04T19:00:06Z)
Quantifying detection rates for dangerous capabilities: a theoretical model of dangerous capability evaluations [47.698233647783965]
We present a quantitative model for tracking dangerous AI capabilities over time.<n>Our goal is to help the policy and research community visualise how dangerous capability testing can give us an early warning about approaching AI risks.
arXiv Detail & Related papers (2024-12-19T22:31:34Z)
EARBench: Towards Evaluating Physical Risk Awareness for Task Planning of Foundation Model-based Embodied AI Agents [53.717918131568936]
Embodied artificial intelligence (EAI) integrates advanced AI models into physical entities for real-world interaction.<n>Foundation models as the "brain" of EAI agents for high-level task planning have shown promising results.<n>However, the deployment of these agents in physical environments presents significant safety challenges.<n>This study introduces EARBench, a novel framework for automated physical risk assessment in EAI scenarios.
arXiv Detail & Related papers (2024-08-08T13:19:37Z)
Risk thresholds for frontier AI [1.053373860696675]
One increasingly popular approach is to define capability thresholds. Risk thresholds simply state how much risk would be too much. Main downside is that they are more difficult to evaluate reliably.
arXiv Detail & Related papers (2024-06-20T20:16:29Z)
A Hormetic Approach to the Value-Loading Problem: Preventing the Paperclip Apocalypse? [0.0]
We propose HALO (Hormetic ALignment via Opponent processes), a regulatory paradigm that uses hormetic analysis to regulate the behavioral patterns of AI. We show how HALO can solve the 'paperclip maximizer' scenario, a thought experiment where an unregulated AI tasked with making paperclips could end up converting all matter in the universe into paperclips. Our approach may be used to help create an evolving database of 'values' based on the hedonic calculus of repeatable behaviors with decreasing marginal utility.
arXiv Detail & Related papers (2024-02-12T07:49:48Z)
Absolutist AI [0.0]
Training AI systems with absolute constraints may make considerable progress on many AI safety problems. It provides a guardrail for avoiding the very worst outcomes of misalignment. It could prevent AIs from causing catastrophes for the sake of very valuable consequences.
arXiv Detail & Related papers (2023-07-19T03:40:37Z)
On Adversarial Examples and Stealth Attacks in Artificial Intelligence Systems [62.997667081978825]
We present a formal framework for assessing and analyzing two classes of malevolent action towards generic Artificial Intelligence (AI) systems. The first class involves adversarial examples and concerns the introduction of small perturbations of the input data that cause misclassification. The second class, introduced here for the first time and named stealth attacks, involves small perturbations to the AI system itself.
arXiv Detail & Related papers (2020-04-09T10:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.