Related papers: A Hormetic Approach to the Value-Loading Problem: Preventing the Paperclip Apocalypse?

A Hormetic Approach to the Value-Loading Problem: Preventing the Paperclip Apocalypse?

URL: http://arxiv.org/abs/2402.07462v2
Date: Tue, 13 Feb 2024 05:21:40 GMT
Title: A Hormetic Approach to the Value-Loading Problem: Preventing the Paperclip Apocalypse?
Authors: Nathan I. N. Henry, Mangor Pedersen, Matt Williams, Jamin L. B. Martin, Liesje Donkin
Abstract summary: We propose HALO (Hormetic ALignment via Opponent processes), a regulatory paradigm that uses hormetic analysis to regulate the behavioral patterns of AI. We show how HALO can solve the 'paperclip maximizer' scenario, a thought experiment where an unregulated AI tasked with making paperclips could end up converting all matter in the universe into paperclips. Our approach may be used to help create an evolving database of 'values' based on the hedonic calculus of repeatable behaviors with decreasing marginal utility.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The value-loading problem is a significant challenge for researchers aiming to create artificial intelligence (AI) systems that align with human values and preferences. This problem requires a method to define and regulate safe and optimal limits of AI behaviors. In this work, we propose HALO (Hormetic ALignment via Opponent processes), a regulatory paradigm that uses hormetic analysis to regulate the behavioral patterns of AI. Behavioral hormesis is a phenomenon where low frequencies of a behavior have beneficial effects, while high frequencies are harmful. By modeling behaviors as allostatic opponent processes, we can use either Behavioral Frequency Response Analysis (BFRA) or Behavioral Count Response Analysis (BCRA) to quantify the hormetic limits of repeatable behaviors. We demonstrate how HALO can solve the 'paperclip maximizer' scenario, a thought experiment where an unregulated AI tasked with making paperclips could end up converting all matter in the universe into paperclips. Our approach may be used to help create an evolving database of 'values' based on the hedonic calculus of repeatable behaviors with decreasing marginal utility. This positions HALO as a promising solution for the value-loading problem, which involves embedding human-aligned values into an AI system, and the weak-to-strong generalization problem, which explores whether weak models can supervise stronger models as they become more intelligent. Hence, HALO opens several research avenues that may lead to the development of a computational value system that allows an AI algorithm to learn whether the decisions it makes are right or wrong.

Related papers

Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas [34.90544849649325]
Inspired by how risky behaviors in humans are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors.<n>We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes.<n>We show that values in LitmusValues can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
arXiv Detail & Related papers (2025-05-20T17:24:09Z)
Neurodivergent Influenceability as a Contingent Solution to the AI Alignment Problem [1.3905735045377272]
The AI alignment problem, which focusses on ensuring that artificial intelligence (AI) systems act according to human values, presents profound challenges.<n>With the progression from narrow AI to Artificial General Intelligence (AGI) and Superintelligence, fears about control and existential risk have escalated.<n>Here, we investigate whether embracing inevitable AI misalignment can be a contingent strategy to foster a dynamic ecosystem of competing agents.
arXiv Detail & Related papers (2025-05-05T11:33:18Z)
Direct Advantage Regression: Aligning LLMs with Online AI Reward [59.78549819431632]
Online AI Feedback (OAIF) presents a promising alternative to Reinforcement Learning from Human Feedback (RLHF) We propose Direct Advantage Regression (DAR) to optimize policy improvement through weighted supervised fine-tuning. Our empirical results underscore that AI reward is a better form of AI supervision consistently achieving higher human-AI agreement as opposed to AI preference.
arXiv Detail & Related papers (2025-04-19T04:44:32Z)
Accelerated zero-order SGD under high-order smoothness and overparameterized regime [79.85163929026146]
We present a novel gradient-free algorithm to solve convex optimization problems. Such problems are encountered in medicine, physics, and machine learning. We provide convergence guarantees for the proposed algorithm under both types of noise.
arXiv Detail & Related papers (2024-11-21T10:26:17Z)
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing [63.20133320524577]
Large Language Models (LLMs) have demonstrated great potential as generalist assistants. It is crucial that these models exhibit desirable behavioral traits, such as non-toxicity and resilience against jailbreak attempts. In this paper, we observe that directly editing a small subset of parameters can effectively modulate specific behaviors of LLMs.
arXiv Detail & Related papers (2024-07-11T17:52:03Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Actions Speak What You Want: Provably Sample-Efficient Reinforcement Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks [94.07688076435818]
We study reinforcement learning for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure. Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader's decision making problem.
arXiv Detail & Related papers (2023-07-26T10:24:17Z)
On-Robot Bayesian Reinforcement Learning for POMDPs [16.667924736270415]
This paper advances Bayesian reinforcement learning for robotics by proposing a specialized framework for physical systems. We capture this knowledge in a factored representation, then demonstrate the posterior factorizes in a similar shape, and ultimately formalize the model in a Bayesian framework. We then introduce a sample-based online solution method, based on Monte-Carlo tree search and particle filtering, specialized to solve the resulting model.
arXiv Detail & Related papers (2023-07-22T01:16:29Z)
Absolutist AI [0.0]
Training AI systems with absolute constraints may make considerable progress on many AI safety problems. It provides a guardrail for avoiding the very worst outcomes of misalignment. It could prevent AIs from causing catastrophes for the sake of very valuable consequences.
arXiv Detail & Related papers (2023-07-19T03:40:37Z)
Fairness in AI and Its Long-Term Implications on Society [68.8204255655161]
We take a closer look at AI fairness and analyze how lack of AI fairness can lead to deepening of biases over time. We discuss how biased models can lead to more negative real-world outcomes for certain groups. If the issues persist, they could be reinforced by interactions with other risks and have severe implications on society in the form of social unrest.
arXiv Detail & Related papers (2023-04-16T11:22:59Z)
Constitutional AI: Harmlessness from AI Feedback [19.964791766072132]
We experiment with methods for training a harmless AI assistant through self-improvement. The only human oversight is provided through a list of rules or principles. We are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them.
arXiv Detail & Related papers (2022-12-15T06:19:23Z)
Learning of Parameters in Behavior Trees for Movement Skills [0.9562145896371784]
Behavior Trees (BTs) can provide a policy representation that supports modular and composable skills. We present a novel algorithm that can learn the parameters of a BT policy in simulation and then generalize to the physical robot without any additional training.
arXiv Detail & Related papers (2021-09-27T13:46:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.