From Instructions to Intrinsic Human Values -- A Survey of Alignment
Goals for Big Models
- URL: http://arxiv.org/abs/2308.12014v2
- Date: Mon, 4 Sep 2023 03:32:05 GMT
- Title: From Instructions to Intrinsic Human Values -- A Survey of Alignment
Goals for Big Models
- Authors: Jing Yao, Xiaoyuan Yi, Xiting Wang, Jindong Wang and Xing Xie
- Abstract summary: We conduct a survey of different alignment goals in existing work and trace their evolution paths to help identify the most essential goal.
Our analysis reveals a goal transformation from fundamental abilities to value orientation, indicating the potential of intrinsic human values as the alignment goal for enhanced LLMs.
- Score: 48.326660953180145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Big models, exemplified by Large Language Models (LLMs), are models typically
pre-trained on massive data and comprised of enormous parameters, which not
only obtain significantly improved performance across diverse tasks but also
present emergent capabilities absent in smaller models. However, the growing
intertwining of big models with everyday human lives poses potential risks and
might cause serious social harm. Therefore, many efforts have been made to
align LLMs with humans to make them better follow user instructions and satisfy
human preferences. Nevertheless, `what to align with' has not been fully
discussed, and inappropriate alignment goals might even backfire. In this
paper, we conduct a comprehensive survey of different alignment goals in
existing work and trace their evolution paths to help identify the most
essential goal. Particularly, we investigate related works from two
perspectives: the definition of alignment goals and alignment evaluation. Our
analysis encompasses three distinct levels of alignment goals and reveals a
goal transformation from fundamental abilities to value orientation, indicating
the potential of intrinsic human values as the alignment goal for enhanced
LLMs. Based on such results, we further discuss the challenges of achieving
such intrinsic value alignment and provide a collection of available resources
for future research on the alignment of big models.
Related papers
- Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - Uncovering Factor Level Preferences to Improve Human-Model Alignment [58.50191593880829]
We introduce PROFILE, a framework that uncovers and quantifies the influence of specific factors driving preferences.
ProFILE's factor level analysis explains the 'why' behind human-model alignment and misalignment.
We demonstrate how leveraging factor level insights, including addressing misaligned factors, can improve alignment with human preferences.
arXiv Detail & Related papers (2024-10-09T15:02:34Z) - On the Modeling Capabilities of Large Language Models for Sequential Decision Making [52.128546842746246]
Large pretrained models are showing increasingly better performance in reasoning and planning tasks.
We evaluate their ability to produce decision-making policies, either directly, by generating actions, or indirectly.
In environments with unfamiliar dynamics, we explore how fine-tuning LLMs with synthetic data can significantly improve their reward modeling capabilities.
arXiv Detail & Related papers (2024-10-08T03:12:57Z) - Towards Measuring Goal-Directedness in AI Systems [0.0]
A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner.
We propose a new family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many reward functions.
Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals.
arXiv Detail & Related papers (2024-10-07T01:34:42Z) - Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models [46.09562860220433]
We introduce GazeReward, a novel framework that integrates implicit feedback -- and specifically eye-tracking (ET) data -- into the Reward Model (RM)
Our approach significantly improves the accuracy of the RM on established human preference datasets.
arXiv Detail & Related papers (2024-10-02T13:24:56Z) - Benchmarking General-Purpose In-Context Learning [19.40952728849431]
In-context learning (ICL) empowers generative models to address new tasks effectively and efficiently on the fly.
In this paper, we study extending ICL to address a broader range of tasks with an extended learning horizon and higher improvement potential.
We introduce two benchmarks specifically crafted to train and evaluate GPICL functionalities.
arXiv Detail & Related papers (2024-05-27T14:50:42Z) - Assessment of Multimodal Large Language Models in Alignment with Human Values [43.023052912326314]
We introduce Ch3Ef, a Compreh3ensive Evaluation dataset and strategy for assessing alignment with human expectations.
Ch3Ef dataset contains 1002 human-annotated data samples, covering 12 domains and 46 tasks based on the hhh principle.
arXiv Detail & Related papers (2024-03-26T16:10:21Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.