Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
- URL: http://arxiv.org/abs/2510.06092v1
- Date: Tue, 07 Oct 2025 16:20:14 GMT
- Title: Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
- Authors: Nyal Patel, Matthieu Bou, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo,
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences.<n>Existing approaches attempt to extract these latent incentives using Inverse Reinforcement Learning (IRL)<n>We introduce a novel emphfailure-aware IRL algorithm that focuses on misclassified or difficult examples to recover the latent rewards defining model behaviors.
- Score: 8.030821324147515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences, yet the underlying reward signals they internalize remain hidden, posing a critical challenge for interpretability and safety. Existing approaches attempt to extract these latent incentives using Inverse Reinforcement Learning (IRL), but treat all preference pairs equally, often overlooking the most informative signals: those examples the extracted reward model misclassifies or assigns nearly equal scores, which we term \emph{failures}. We introduce a novel \emph{failure-aware} IRL algorithm that focuses on misclassified or difficult examples to recover the latent rewards defining model behaviors. By learning from these failures, our failure-aware IRL extracts reward functions that better reflect the true objectives behind RLHF. We demonstrate that failure-aware IRL outperforms existing IRL baselines across multiple metrics when applied to LLM detoxification, without requiring external classifiers or supervision. Crucially, failure-aware IRL yields rewards that better capture the true incentives learned during RLHF, enabling more effective re-RLHF training than standard IRL. This establishes failure-aware IRL as a robust, scalable method for auditing model alignment and reducing ambiguity in the IRL process.
Related papers
- Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models [59.6715047267181]
Small reasoning models (SRMs) are prone to hallucinations, especially in intermediate reasoning steps.<n>Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained chain-of-thought evaluation.<n>We propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model.
arXiv Detail & Related papers (2026-02-05T17:15:12Z) - From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z) - Replay Failures as Successes: Sample-Efficient Reinforcement Learning for Instruction Following [42.05102776289243]
Reinforcement Learning (RL) has shown promise for aligning Large Language Models (LLMs) to follow instructions with various constraints.<n>We propose Hindsight instruction Replay (HiR), a novel sample-efficient RL framework for complex instruction following tasks.
arXiv Detail & Related papers (2025-12-29T13:31:08Z) - ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs [32.13266235550995]
Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs)<n>Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model's own confidence estimates.
arXiv Detail & Related papers (2025-09-22T13:00:35Z) - One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Linear Probe Penalties Reduce LLM Sycophancy [3.6490659260835234]
Large language models (LLMs) are often sycophantic, prioritizing agreement with their users over accurate or objective statements.<n>This problematic behavior becomes more pronounced during reinforcement learning from human feedback (RLHF)<n>We develop a linear probing method to identify and penalize markers of sycophancy within the reward model, producing rewards that discourage sycophantic behavior.
arXiv Detail & Related papers (2024-12-01T21:11:28Z) - The Alignment Ceiling: Objective Mismatch in Reinforcement Learning from
Human Feedback [5.037876196534672]
Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings.
In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions.
arXiv Detail & Related papers (2023-10-31T21:52:41Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.