Related papers: Strong and weak alignment of large language models with human values

Strong and weak alignment of large language models with human values

URL: http://arxiv.org/abs/2408.04655v2
Date: Mon, 12 Aug 2024 13:20:36 GMT
Title: Strong and weak alignment of large language models with human values
Authors: Mehdi Khamassi, Marceau Nahon, Raja Chatila,
Abstract summary: Minimizing negative impacts of Artificial Intelligent (AI) systems requires them to be able to align with human values. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. We propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal.
Score: 1.6590638305972631
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Minimizing negative impacts of Artificial Intelligent (AI) systems on human societies without human supervision requires them to be able to align with human values. However, most current work only addresses this issue from a technical point of view, e.g., improving current methods relying on reinforcement learning from human feedback, neglecting what it means and is required for alignment to occur. Here, we propose to distinguish strong and weak value alignment. Strong alignment requires cognitive abilities (either human-like or different from humans) such as understanding and reasoning about agents' intentions and their ability to causally produce desired effects. We argue that this is required for AI systems like large language models (LLMs) to be able to recognize situations presenting a risk that human values may be flouted. To illustrate this distinction, we present a series of prompts showing ChatGPT's, Gemini's and Copilot's failures to recognize some of these situations. We moreover analyze word embeddings to show that the nearest neighbors of some human values in LLMs differ from humans' semantic representations. We then propose a new thought experiment that we call "the Chinese room with a word transition dictionary", in extension of John Searle's famous proposal. We finally mention current promising research directions towards a weak alignment, which could produce statistically satisfying answers in a number of common situations, however so far without ensuring any truth value.

Related papers

Modeling Human Beliefs about AI Behavior for Scalable Oversight [15.535954576226207]
As AI systems grow more capable, human feedback becomes increasingly unreliable. This raises the problem of scalable oversight: How can we supervise AI systems that exceed human capabilities? We propose to model the human evaluator's beliefs about the AI system's behavior to better interpret the human's feedback.
arXiv Detail & Related papers (2025-02-28T17:39:55Z)
Trying to be human: Linguistic traces of stochastic empathy in language models [0.2638512174804417]
Large language models (LLMs) are crucial drivers behind the increased quality of computer-generated content. Our work tests how two important factors contribute to the human vs AI race: empathy and an incentive to appear human.
arXiv Detail & Related papers (2024-10-02T15:46:40Z)
Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion? [18.770522926093786]
Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. We study a special application of ToM abilities that has higher stakes and possibly irreversible consequences. We focus on the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot's generated behavior in a manner similar to human observer.
arXiv Detail & Related papers (2024-01-10T18:09:36Z)
Learning Human-like Representations to Enable Learning Human Values [11.236150405125754]
We explore the effects of representational alignment between humans and AI agents on learning human values. We show that this kind of representational alignment can support safely learning and exploring human values in the context of personalization.
arXiv Detail & Related papers (2023-12-21T18:31:33Z)
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties [68.66719970507273]
Value pluralism is the view that multiple correct values may be held in tension with one another. As statistical learners, AI systems fit to averages by default, washing out potentially irreducible value conflicts. We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations.
arXiv Detail & Related papers (2023-09-02T01:24:59Z)
A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds? [2.7342737448775534]
Large Language Models (LLMs) have been linked to claims about human-like linguistic performance. We analyze the contribution of LLMs as theoretically informative representations of a target cognitive system. We evaluate the models' ability to see the bigger picture, through top-down feedback from higher levels of processing.
arXiv Detail & Related papers (2023-07-26T18:58:53Z)
Taming AI Bots: Controllability of Neural States in Large Language Models [81.1573516550699]
We first introduce a formal definition of meaning'' that is amenable to analysis. We then characterize meaningful data'' on which large language models (LLMs) are ostensibly trained. We show that, when restricted to the space of meanings, an AI bot is controllable.
arXiv Detail & Related papers (2023-05-29T03:58:33Z)
When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions. A central challenge for AI safety is capturing the flexibility of the human moral mind. We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z)
Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs for Centaurs [22.52332536886295]
We present a novel formulation of the interaction between the human and the AI as a sequential game. We show that in this case the AI's problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP. We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human.
arXiv Detail & Related papers (2022-04-03T21:00:51Z)
Towards Abstract Relational Learning in Human Robot Interaction [73.67226556788498]
Humans have a rich representation of the entities in their environment. If robots need to interact successfully with humans, they need to represent entities, attributes, and generalizations in a similar way. In this work, we address the problem of how to obtain these representations through human-robot interaction.
arXiv Detail & Related papers (2020-11-20T12:06:46Z)
Aligning AI With Shared Human Values [85.2824609130584]
We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. We find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
arXiv Detail & Related papers (2020-08-05T17:59:16Z)
Joint Inference of States, Robot Knowledge, and Human (False-)Beliefs [90.20235972293801]
Aiming to understand how human (false-temporal)-belief-a core socio-cognitive ability unify-would affect human interactions with robots, this paper proposes to adopt a graphical model to the representation of object states, robot knowledge, and human (false-)beliefs. An inference algorithm is derived to fuse individual pg from all robots across multi-views into a joint pg, which affords more effective reasoning inference capability to overcome the errors originated from a single view.
arXiv Detail & Related papers (2020-04-25T23:02:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.