Related papers: Understanding the Learning Dynamics of Alignment with Human Feedback

Related papers

Unsupervised Synthetic Image Attribution: Alignment and Disentanglement [55.853285140682665]
We propose a simple yet effective unsupervised method called Alignment and Disentanglement.<n>Specifically, we begin by performing basic concept alignment using contrastive self-supervised learning.<n>Next, we enhance the model's attribution ability by promoting representation disentanglement with the Infomax loss.
arXiv Detail & Related papers (2026-01-30T07:31:53Z)
Alignment and Safety in Large Language Models: Safety Mechanisms, Training Paradigms, and Emerging Challenges [47.14342587731284]
This survey provides a comprehensive overview of alignment techniques, training protocols, and empirical findings in large language models (LLMs) alignment.<n>We analyze the development of alignment methods across diverse paradigms, characterizing the fundamental trade-offs between core alignment objectives.<n>We discuss state-of-the-art techniques, including Direct Preference Optimization (DPO), Constitutional AI, brain-inspired methods, and alignment uncertainty quantification (AUQ)
arXiv Detail & Related papers (2025-07-25T20:52:58Z)
Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
Improving Fairness in LLMs Through Testing-Time Adversaries [1.7811840395202343]
Large Language Models (LLMs) push the bound-aries in natural language processing and generative AI.<n>In this work, we propose a straightforward, user-friendly and practical method to mitigate such biases.<n>Our method creates multiple variations of a given sentence by modifying specific attributes and evaluates the corresponding prediction behavior.
arXiv Detail & Related papers (2025-05-17T17:56:53Z)
How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations [69.72654127617058]
Post-hoc importance attribution methods are a popular tool for "explaining" Deep Neural Networks (DNNs) In this work we bring forward empirical evidence that challenges this very notion. We discover a strong dependency on and demonstrate that the training details of a pre-trained model's classification layer play a crucial role.
arXiv Detail & Related papers (2025-03-01T22:25:11Z)
Aligning Visual Contrastive learning models via Preference Optimization [0.9438963196770565]
This paper introduces a novel method for training contrastive learning models using Preference Optimization (PO) to break down complex concepts. Our method systematically aligns model behavior with desired preferences, enhancing performance on the targeted task. In particular, we focus on enhancing model robustness against typographic attacks, commonly seen in contrastive models like CLIP. We further apply our method to disentangle gender understanding and mitigate gender biases, offering a more nuanced control over these sensitive attributes.
arXiv Detail & Related papers (2024-11-12T08:14:54Z)
From Efficiency to Equity: Measuring Fairness in Preference Learning [3.2132738637761027]
We evaluate fairness in preference learning models inspired by economic theories of inequality and Rawlsian justice. We propose metrics adapted from the Gini Coefficient, Atkinson Index, and Kuznets Ratio to quantify fairness in these models.
arXiv Detail & Related papers (2024-10-24T15:25:56Z)
Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options [2.1184929769291294]
This work introduces a novel framework for evaluating LLMs' capacity to balance instruction-following with critical reasoning.<n>We show that post-training aligned models often default to selecting invalid options, while base models exhibit improved refusal capabilities that scale with model size.<n>We additionally conduct a parallel human study showing similar instruction-following biases, with implications for how these biases may propagate through human feedback datasets used in alignment.
arXiv Detail & Related papers (2024-08-27T19:27:43Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z)
Enhancing Fairness and Performance in Machine Learning Models: A Multi-Task Learning Approach with Monte-Carlo Dropout and Pareto Optimality [1.5498930424110338]
This study introduces an approach to mitigate bias in machine learning by leveraging model uncertainty. Our approach utilizes a multi-task learning (MTL) framework combined with Monte Carlo (MC) Dropout to assess and mitigate uncertainty in predictions related to protected labels.
arXiv Detail & Related papers (2024-04-12T04:17:50Z)
Supervised Fine-Tuning as Inverse Reinforcement Learning [8.044033685073003]
The prevailing approach to aligning Large Language Models (LLMs) typically relies on human or AI feedback. In our work, we question the efficacy of such datasets and explore various scenarios where alignment with expert demonstrations proves more realistic.
arXiv Detail & Related papers (2024-03-18T17:52:57Z)
A Bayesian Approach to Robust Inverse Reinforcement Learning [54.24816623644148]
We consider a Bayesian approach to offline model-based inverse reinforcement learning (IRL) The proposed framework differs from existing offline model-based IRL approaches by performing simultaneous estimation of the expert's reward function and subjective model of environment dynamics. Our analysis reveals a novel insight that the estimated policy exhibits robust performance when the expert is believed to have a highly accurate model of the environment.
arXiv Detail & Related papers (2023-09-15T17:37:09Z)
Understanding Self-Predictive Learning for Reinforcement Learning [61.62067048348786]
We study the learning dynamics of self-predictive learning for reinforcement learning. We propose a novel self-predictive algorithm that learns two representations simultaneously.
arXiv Detail & Related papers (2022-12-06T20:43:37Z)
Scalable PAC-Bayesian Meta-Learning via the PAC-Optimal Hyper-Posterior: From Theory to Practice [54.03076395748459]
A central question in the meta-learning literature is how to regularize to ensure generalization to unseen tasks. We present a generalization bound for meta-learning, which was first derived by Rothfuss et al. We provide a theoretical analysis and empirical case study under which conditions and to what extent these guarantees for meta-learning improve upon PAC-Bayesian per-task learning bounds.
arXiv Detail & Related papers (2022-11-14T08:51:04Z)
Fair Representation Learning using Interpolation Enabled Disentanglement [9.043741281011304]
We propose a novel method to address two key issues: (a) Can we simultaneously learn fair disentangled representations while ensuring the utility of the learned representation for downstream tasks, and (b)Can we provide theoretical insights into when the proposed approach will be both fair and accurate. To address the former, we propose the method FRIED, Fair Representation learning using Interpolation Enabled Disentanglement.
arXiv Detail & Related papers (2021-07-31T17:32:12Z)
Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. We develop an approach for representation learning in RL that sits in between these two extremes. This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.