Related papers: Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

URL: http://arxiv.org/abs/2510.18849v1
Date: Tue, 21 Oct 2025 17:40:03 GMT
Title: Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
Authors: Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou,
Abstract summary: We propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization.<n>Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning.
Score: 22.252030067675065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.

Related papers

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [66.55381105691818]
We propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling.<n>P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics.<n>It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism.
arXiv Detail & Related papers (2026-02-12T16:07:22Z)
One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment [55.86333374784959]
We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation.<n>We propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem.<n>We show that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
arXiv Detail & Related papers (2026-01-26T17:55:52Z)
PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning [57.10486355722133]
PersonaDual supports both general-purpose objective reasoning and personalized reasoning in a single model.<n>Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference.
arXiv Detail & Related papers (2026-01-13T16:02:35Z)
P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist [11.399221632873934]
We propose P-Check, a novel personalized reward modeling framework.<n>P-Check trains a plug-and-play checklist generator that synthesizes dynamic evaluation criteria for guiding the reward prediction.<n>We conduct experiments and demonstrate that P-Check not only improves reward accuracy but also enhances downstream personalized generation.
arXiv Detail & Related papers (2026-01-06T12:53:53Z)
Benchmarking and Improving LLM Robustness for Personalized Generation [42.26075952121524]
We define a model as robust if its responses are both factually accurate and align with the user preferences.<n>Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned deployments.
arXiv Detail & Related papers (2025-09-18T13:56:14Z)
LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z)
Learning from Natural Language Feedback for Personalized Question Answering [21.115495457454365]
Personalization is crucial for enhancing the effectiveness and user satisfaction of language technologies.<n>Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG)<n>We introduce Vac, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF)
arXiv Detail & Related papers (2025-08-14T14:36:53Z)
User-centric Subjective Leaderboard by Customizable Reward Modeling [34.40455169451943]
We present the first User-Centric Subjective Leaderboard (USL)<n>It provides a preference-driven, dynamic ranking of large language models (LLMs) across diverse real-world scenarios.<n>Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries.
arXiv Detail & Related papers (2025-08-13T03:39:04Z)
RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback [57.967762383794806]
RefCritic is a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards.<n>We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks.
arXiv Detail & Related papers (2025-07-20T16:19:51Z)
Learning to summarize user information for personalized reinforcement learning from human feedback [19.859785715555013]
Preference Learning Using Summarization (PLUS) uses reinforcement learning to learn to produce text-based summaries of each user's preferences.<n>Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop.<n>We show that PLUS capture diverse aspects of user preferences, achieving a 11-77% improvement in reward model accuracy.
arXiv Detail & Related papers (2025-07-17T23:48:51Z)
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment [35.68913976348608]
We introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework to iteratively infer and refine user profiles through dialogue.<n>We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue.
arXiv Detail & Related papers (2025-05-21T12:38:36Z)
Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [63.116041268654705]
We find that different internal reward models within the same Large Language Models often generate inconsistent preferences.<n>This inconsistency raises concerns about the reliability of self-generated preference data, hinders overall alignment performance, and highlights the need for further research.<n>We propose Self-Consistent Internal Rewards (SCIR), a novel framework designed to enhance consistency among internal reward models during training.
arXiv Detail & Related papers (2025-02-13T03:15:31Z)
Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision.<n> Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.