Democratizing Reward Design for Personal and Representative Value-Alignment
- URL: http://arxiv.org/abs/2410.22203v1
- Date: Tue, 29 Oct 2024 16:37:01 GMT
- Title: Democratizing Reward Design for Personal and Representative Value-Alignment
- Authors: Carter Blair, Kate Larson, Edith Law,
- Abstract summary: We introduce Interactive-Reflective Dialogue Alignment, a method that iteratively engages users in reflecting on and specifying their subjective value definitions.
This system learns individual value definitions through language-model-based preference elicitation and constructs personalized reward models.
Our findings demonstrate diverse definitions of value-aligned behaviour and show that our system can accurately capture each person's unique understanding.
- Score: 10.1630183955549
- License:
- Abstract: Aligning AI agents with human values is challenging due to diverse and subjective notions of values. Standard alignment methods often aggregate crowd feedback, which can result in the suppression of unique or minority preferences. We introduce Interactive-Reflective Dialogue Alignment, a method that iteratively engages users in reflecting on and specifying their subjective value definitions. This system learns individual value definitions through language-model-based preference elicitation and constructs personalized reward models that can be used to align AI behaviour. We evaluated our system through two studies with 30 participants, one focusing on "respect" and the other on ethical decision-making in autonomous vehicles. Our findings demonstrate diverse definitions of value-aligned behaviour and show that our system can accurately capture each person's unique understanding. This approach enables personalized alignment and can inform more representative and interpretable collective alignment strategies.
Related papers
- MAP: Multi-Human-Value Alignment Palette [22.74688073469946]
We develop a novel, first-principle approach called Multi-Human-Value Alignment Palette (MAP)
MAP navigates the alignment across multiple human values in a structured and reliable way.
We conduct a detailed theoretical analysis of MAP by quantifying the trade-offs between values, the sensitivity to constraints, the fundamental connection between multi-value alignment and sequential alignment, and proving that linear weighted rewards are sufficient for multi-value alignment.
arXiv Detail & Related papers (2024-10-24T23:16:39Z) - Can Language Models Reason about Individualistic Human Values and Preferences? [44.249817353449146]
We study language models (LMs) on the specific challenge of individualistic value reasoning.
We reveal critical limitations in frontier LMs' abilities to reason about individualistic human values with accuracies between 55% to 65%.
We also identify a partiality of LMs in reasoning about global individualistic values, as measured by our proposed Value Inequity Index (sigmaINEQUITY)
arXiv Detail & Related papers (2024-10-04T19:03:41Z) - Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback.
Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference.
Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z) - POV Learning: Individual Alignment of Multimodal Models using Human Perception [1.4796543791607086]
We argue that alignment on an individual level can boost the subjective predictive performance for the individual user interacting with the system.
We test this, by integrating perception information into machine learning systems and measuring their predictive performance.
Our findings suggest that exploiting individual perception signals for the machine learning of subjective human assessments provides a valuable cue for individual alignment.
arXiv Detail & Related papers (2024-05-07T16:07:29Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Concept Alignment as a Prerequisite for Value Alignment [11.236150405125754]
Value alignment is essential for building AI systems that can safely and reliably interact with people.
We show how concept alignment can lead to systematic value mis-alignment.
We describe an approach that helps minimize such failure modes by jointly reasoning about a person's concepts and values.
arXiv Detail & Related papers (2023-10-30T22:23:15Z) - Evaluating the Fairness of Discriminative Foundation Models in Computer
Vision [51.176061115977774]
We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP)
We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy.
Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning.
arXiv Detail & Related papers (2023-10-18T10:32:39Z) - Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties [68.66719970507273]
Value pluralism is the view that multiple correct values may be held in tension with one another.
As statistical learners, AI systems fit to averages by default, washing out potentially irreducible value conflicts.
We introduce ValuePrism, a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations.
arXiv Detail & Related papers (2023-09-02T01:24:59Z) - Heterogeneous Value Alignment Evaluation for Large Language Models [91.96728871418]
Large Language Models (LLMs) have made it crucial to align their values with those of humans.
We propose a Heterogeneous Value Alignment Evaluation (HVAE) system to assess the success of aligning LLMs with heterogeneous values.
arXiv Detail & Related papers (2023-05-26T02:34:20Z) - Fully Unsupervised Person Re-identification viaSelective Contrastive
Learning [58.5284246878277]
Person re-identification (ReID) aims at searching the same identity person among images captured by various cameras.
We propose a novel selective contrastive learning framework for unsupervised feature learning.
Experimental results demonstrate the superiority of our method in unsupervised person ReID compared with the state-of-the-arts.
arXiv Detail & Related papers (2020-10-15T09:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.