Unintended Impacts of LLM Alignment on Global Representation
- URL: http://arxiv.org/abs/2402.15018v2
- Date: Thu, 6 Jun 2024 22:31:48 GMT
- Title: Unintended Impacts of LLM Alignment on Global Representation
- Authors: Michael J. Ryan, William Held, Diyi Yang,
- Abstract summary: We show that developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO)
We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide.
We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning.
- Score: 62.6579934112071
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.
Related papers
- VLP: Vision-Language Preference Learning for Embodied Manipulation [29.7387976970634]
We propose a vision-language preference model to provide preference feedback for embodied manipulation tasks.
The preference model learns to extract language-related features, and then serves as a preference annotator in various downstream tasks.
Our method provides accurate preferences and generalizes to unseen tasks and unseen language instructions, outperforming the baselines by a large margin.
arXiv Detail & Related papers (2025-02-17T15:32:14Z) - Northeastern Uni at Multilingual Counterspeech Generation: Enhancing Counter Speech Generation with LLM Alignment through Direct Preference Optimization [1.1368382184602488]
The automatic generation of counter-speech (CS) is a critical strategy for addressing hate speech by providing constructive and informed responses.
Existing methods often fail to generate high-quality, impactful, and scalable CS, particularly across diverse linguistic contexts.
We propose a novel methodology to enhance CS generation by aligning Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
arXiv Detail & Related papers (2024-12-19T23:22:11Z) - ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework [78.07201802874529]
ShifCon is a Shift-based Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one.
It shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters.
Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages.
arXiv Detail & Related papers (2024-10-25T10:28:59Z) - MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora.
To make LLMs more usable, aligning them with human preferences is essential.
We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z) - Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values.
Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives.
We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z) - Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step.
Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z) - MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization [65.31411639849516]
We propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO) to align the reasoning processes in other languages with the dominant language.
Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages.
Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models.
arXiv Detail & Related papers (2024-01-12T18:03:54Z) - ULMA: Unified Language Model Alignment with Human Demonstration and
Point-wise Preference [16.73260713938154]
A typical alignment procedure consists of supervised fine-tuning and preference learning.
We introduce Point-wise Direct Preference Optimization, a novel preference learning method designed to harness point-wise feedback effectively.
Our work also uncovers a novel connection between supervised fine-tuning and point-wise preference learning, culminating in Unified Language Model Alignment.
arXiv Detail & Related papers (2023-12-05T07:52:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.