Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
- URL: http://arxiv.org/abs/2510.18849v1
- Date: Tue, 21 Oct 2025 17:40:03 GMT
- Title: Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning
- Authors: Chenghao Zhu, Meiling Tao, Tiannan Wang, Dongyi Ding, Yuchen Eleanor Jiang, Wangchunshu Zhou,
- Abstract summary: We propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization.<n>Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning.
- Score: 22.252030067675065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Faithfully personalizing large language models (LLMs) to align with individual user preferences is a critical but challenging task. While supervised fine-tuning (SFT) quickly reaches a performance plateau, standard reinforcement learning from human feedback (RLHF) also struggles with the nuances of personalization. Scalar-based reward models are prone to reward hacking which leads to verbose and superficially personalized responses. To address these limitations, we propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization. Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning. Under a rigorous length-controlled evaluation, our method substantially outperforms standard PPO on personalization benchmarks. Personalized Qwen2.5-7B achieves an average 11\% win-rate improvement, and personalized Qwen2.5-14B model surpasses the performance of GPT-4.1. These results demonstrate a practical path to faithful, efficient, and controllable personalization.
Related papers
- P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [66.55381105691818]
We propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling.<n>P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics.<n>It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism.
arXiv Detail & Related papers (2026-02-12T16:07:22Z) - One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment [55.86333374784959]
We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation.<n>We propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem.<n>We show that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
arXiv Detail & Related papers (2026-01-26T17:55:52Z) - PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning [57.10486355722133]
PersonaDual supports both general-purpose objective reasoning and personalized reasoning in a single model.<n>Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference.
arXiv Detail & Related papers (2026-01-13T16:02:35Z) - P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist [11.399221632873934]
We propose P-Check, a novel personalized reward modeling framework.<n>P-Check trains a plug-and-play checklist generator that synthesizes dynamic evaluation criteria for guiding the reward prediction.<n>We conduct experiments and demonstrate that P-Check not only improves reward accuracy but also enhances downstream personalized generation.
arXiv Detail & Related papers (2026-01-06T12:53:53Z) - Benchmarking and Improving LLM Robustness for Personalized Generation [42.26075952121524]
We define a model as robust if its responses are both factually accurate and align with the user preferences.<n>Our work highlights critical gaps in current evaluation practices and introduces tools and metrics to support more reliable, user-aligned deployments.
arXiv Detail & Related papers (2025-09-18T13:56:14Z) - LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.71684530652942]
We show that LLaVA-Critic-R1 emerges as a top-performing critic but also as a competitive policy model.<n>Applying self-critique at test time yields an average +13.8% improvement on five representative reasoning tasks.<n>Our results reveal that RL training on critic data can produce a unified model excelling at both evaluation and generation.
arXiv Detail & Related papers (2025-08-31T03:08:02Z) - Learning from Natural Language Feedback for Personalized Question Answering [21.115495457454365]
Personalization is crucial for enhancing the effectiveness and user satisfaction of language technologies.<n>Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG)<n>We introduce Vac, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF)
arXiv Detail & Related papers (2025-08-14T14:36:53Z) - User-centric Subjective Leaderboard by Customizable Reward Modeling [34.40455169451943]
We present the first User-Centric Subjective Leaderboard (USL)<n>It provides a preference-driven, dynamic ranking of large language models (LLMs) across diverse real-world scenarios.<n>Our work is built upon a thorough investigation of real human preference data, involving more than 10K subjective queries.
arXiv Detail & Related papers (2025-08-13T03:39:04Z) - RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback [57.967762383794806]
RefCritic is a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards.<n>We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks.
arXiv Detail & Related papers (2025-07-20T16:19:51Z) - Learning to summarize user information for personalized reinforcement learning from human feedback [19.859785715555013]
Preference Learning Using Summarization (PLUS) uses reinforcement learning to learn to produce text-based summaries of each user's preferences.<n>Both the user-summarization model and reward model are trained simultaneously, creating an online co-adaptation loop.<n>We show that PLUS capture diverse aspects of user preferences, achieving a 11-77% improvement in reward model accuracy.
arXiv Detail & Related papers (2025-07-17T23:48:51Z) - Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment [35.68913976348608]
We introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework to iteratively infer and refine user profiles through dialogue.<n>We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue.
arXiv Detail & Related papers (2025-05-21T12:38:36Z) - Self-Consistency of the Internal Reward Models Improves Self-Rewarding Language Models [63.116041268654705]
We find that different internal reward models within the same Large Language Models often generate inconsistent preferences.<n>This inconsistency raises concerns about the reliability of self-generated preference data, hinders overall alignment performance, and highlights the need for further research.<n>We propose Self-Consistent Internal Rewards (SCIR), a novel framework designed to enhance consistency among internal reward models during training.
arXiv Detail & Related papers (2025-02-13T03:15:31Z) - Self-Generated Critiques Boost Reward Modeling for Language Models [57.60881438647227]
Critic-RM is a framework that improves reward models using self-generated critiques without extra supervision.<n> Experiments show that Critic-RM improves reward modeling accuracy by 3.7%-7.3% compared to standard reward models and LLM judges.
arXiv Detail & Related papers (2024-11-25T18:28:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.