Related papers: Unintended Impacts of LLM Alignment on Global Representation

Unintended Impacts of LLM Alignment on Global Representation

URL: http://arxiv.org/abs/2402.15018v2
Date: Thu, 6 Jun 2024 22:31:48 GMT
Title: Unintended Impacts of LLM Alignment on Global Representation
Authors: Michael J. Ryan, William Held, Diyi Yang,
Abstract summary: We show that developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO) We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning.
Score: 62.6579934112071
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide. Our results show that current alignment procedures create disparities between English dialects and global opinions. We find alignment improves capabilities in several languages. We conclude by discussing design decisions that led to these unintended impacts and recommendations for more equitable preference tuning. We make our code and data publicly available on Github.

Related papers

Northeastern Uni at Multilingual Counterspeech Generation: Enhancing Counter Speech Generation with LLM Alignment through Direct Preference Optimization [1.1368382184602488]
The automatic generation of counter-speech (CS) is a critical strategy for addressing hate speech by providing constructive and informed responses. Existing methods often fail to generate high-quality, impactful, and scalable CS, particularly across diverse linguistic contexts. We propose a novel methodology to enhance CS generation by aligning Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
arXiv Detail & Related papers (2024-12-19T23:22:11Z)
Active Preference-based Learning for Multi-dimensional Personalization [7.349038301460469]
Large language models (LLMs) have shown remarkable versatility across tasks, but aligning them with individual human preferences remains challenging. We propose an active preference learning framework that uses binary feedback to estimate user preferences across multiple objectives. We validate our approach through theoretical analysis and experiments on language generation tasks, demonstrating its feedback efficiency and effectiveness in personalizing model responses.
arXiv Detail & Related papers (2024-11-01T11:49:33Z)
MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time [50.41806216615488]
Large Language Models (LLMs) acquire extensive knowledge and remarkable abilities from extensive text corpora. To make LLMs more usable, aligning them with human preferences is essential. We propose an effective method, textbf MetaAlign, which aims to help LLMs dynamically align with various explicit or implicit preferences specified at inference time.
arXiv Detail & Related papers (2024-10-18T05:31:13Z)
Assessing Code Generation with Intermediate Languages [6.999311675957218]
This study explores the utilization of intermediate languages, including various programming languages, natural language solutions, and pseudo-code. Our findings reveal that intermediate languages generally exhibit greater efficacy in larger models that have not yet achieved state-of-the-art performance.
arXiv Detail & Related papers (2024-07-07T15:35:41Z)
Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment [103.12563033438715]
Alignment in artificial intelligence pursues consistency between model responses and human preferences as well as values. Existing alignment techniques are mostly unidirectional, leading to suboptimal trade-offs and poor flexibility over various objectives. We introduce controllable preference optimization (CPO), which explicitly specifies preference scores for different objectives.
arXiv Detail & Related papers (2024-02-29T12:12:30Z)
MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences. Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z)
Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback [70.32795295142648]
Linear alignment is a novel algorithm that aligns language models with human preferences in one single inference step. Experiments on both general and personalized preference datasets demonstrate that linear alignment significantly enhances the performance and efficiency of LLM alignment.
arXiv Detail & Related papers (2024-01-21T10:46:23Z)
MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization [65.31411639849516]
We propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO) to align the reasoning processes in other languages with the dominant language. Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages. Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models.
arXiv Detail & Related papers (2024-01-12T18:03:54Z)
ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference [16.73260713938154]
A typical alignment procedure consists of supervised fine-tuning and preference learning. We introduce Point-wise Direct Preference Optimization, a novel preference learning method designed to harness point-wise feedback effectively. Our work also uncovers a novel connection between supervised fine-tuning and point-wise preference learning, culminating in Unified Language Model Alignment.
arXiv Detail & Related papers (2023-12-05T07:52:12Z)
Sample Efficient Preference Alignment in LLMs via Active Exploration [63.84454768573154]
We take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a worst-case regret bound. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets.
arXiv Detail & Related papers (2023-12-01T00:54:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.