Concept Alignment as a Prerequisite for Value Alignment
- URL: http://arxiv.org/abs/2310.20059v1
- Date: Mon, 30 Oct 2023 22:23:15 GMT
- Title: Concept Alignment as a Prerequisite for Value Alignment
- Authors: Sunayana Rane, Mark Ho, Ilia Sucholutsky, Thomas L. Griffiths
- Abstract summary: Value alignment is essential for building AI systems that can safely and reliably interact with people.
We show how concept alignment can lead to systematic value mis-alignment.
We describe an approach that helps minimize such failure modes by jointly reasoning about a person's concepts and values.
- Score: 11.236150405125754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Value alignment is essential for building AI systems that can safely and
reliably interact with people. However, what a person values -- and is even
capable of valuing -- depends on the concepts that they are currently using to
understand and evaluate what happens in the world. The dependence of values on
concepts means that concept alignment is a prerequisite for value alignment --
agents need to align their representation of a situation with that of humans in
order to successfully align their values. Here, we formally analyze the concept
alignment problem in the inverse reinforcement learning setting, show how
neglecting concept alignment can lead to systematic value mis-alignment, and
describe an approach that helps minimize such failure modes by jointly
reasoning about a person's concepts and values. Additionally, we report
experimental results with human participants showing that humans reason about
the concepts used by an agent when acting intentionally, in line with our joint
reasoning model.
Related papers
- On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse.
We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space.
Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z) - Democratizing Reward Design for Personal and Representative Value-Alignment [10.1630183955549]
We introduce Interactive-Reflective Dialogue Alignment, a method that iteratively engages users in reflecting on and specifying their subjective value definitions.
This system learns individual value definitions through language-model-based preference elicitation and constructs personalized reward models.
Our findings demonstrate diverse definitions of value-aligned behaviour and show that our system can accurately capture each person's unique understanding.
arXiv Detail & Related papers (2024-10-29T16:37:01Z) - ValueCompass: A Framework of Fundamental Values for Human-AI Alignment [15.35489011078817]
We introduce Value, a framework of fundamental values, grounded in psychological theory and a systematic review.
We apply Value to measure the value alignment of humans and language models (LMs) across four real-world vignettes.
Our findings uncover risky misalignment between humans and LMs, such as LMs agreeing with values like "Choose Own Goals", which are largely disagreed by humans.
arXiv Detail & Related papers (2024-09-15T02:13:03Z) - Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models [57.86303579812877]
Concept Bottleneck Models (CBMs) ground image classification on human-understandable concepts to allow for interpretable model decisions.
Existing approaches often require numerous human interventions per image to achieve strong performances.
We introduce a trainable concept realignment intervention module, which leverages concept relations to realign concept assignments post-intervention.
arXiv Detail & Related papers (2024-05-02T17:59:01Z) - InfoCon: Concept Discovery with Generative and Discriminative Informativeness [7.160037417413006]
We focus on the self-supervised discovery of manipulation concepts that can be adapted and reassembled to address various robotic tasks.
We model manipulation concepts as generative and discriminative goals and derive metrics that can autonomously link them to meaningful sub-trajectories.
arXiv Detail & Related papers (2024-03-14T14:14:04Z) - Concept Alignment [10.285482205152729]
We argue that before we can attempt to align values, it is imperative that AI systems and humans align the concepts they use to understand the world.
We integrate ideas from philosophy, cognitive science, and deep learning to explain the need for concept alignment.
arXiv Detail & Related papers (2024-01-09T23:32:18Z) - Interpretability is in the Mind of the Beholder: A Causal Framework for
Human-interpretable Representation Learning [22.201878275784246]
Focus in Explainable AI is shifting from explanations defined in terms of low-level elements, such as input features, to explanations encoded in terms of interpretable concepts learned from data.
How to reliably acquire such concepts is, however, still fundamentally unclear.
We propose a mathematical framework for acquiring interpretable representations suitable for both post-hoc explainers and concept-based neural networks.
arXiv Detail & Related papers (2023-09-14T14:26:20Z) - Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties [68.66719970507273]
Value pluralism is the view that multiple correct values may be held in tension with one another.
As statistical learners, AI systems fit to averages by default, washing out potentially irreducible value conflicts.
We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations.
arXiv Detail & Related papers (2023-09-02T01:24:59Z) - ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image
Diffusion Models [79.10890337599166]
We introduce ConceptBed, a large-scale dataset that consists of 284 unique visual concepts and 33K composite text prompts.
We evaluate visual concepts that are either objects, attributes, or styles, and also evaluate four dimensions of compositionality: counting, attributes, relations, and actions.
Our results point to a trade-off between learning the concepts and preserving the compositionality which existing approaches struggle to overcome.
arXiv Detail & Related papers (2023-06-07T18:00:38Z) - Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans.
We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z) - Aligning AI With Shared Human Values [85.2824609130584]
We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality.
We find that current language models have a promising but incomplete ability to predict basic human ethical judgements.
Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
arXiv Detail & Related papers (2020-08-05T17:59:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.