Related papers: P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist

P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist

URL: http://arxiv.org/abs/2601.02986v1
Date: Tue, 06 Jan 2026 12:53:53 GMT
Title: P-Check: Advancing Personalized Reward Model via Learning to Generate Dynamic Checklist
Authors: Kwangwook Seo, Dongha Lee,
Abstract summary: We propose P-Check, a novel personalized reward modeling framework.<n>P-Check trains a plug-and-play checklist generator that synthesizes dynamic evaluation criteria for guiding the reward prediction.<n>We conduct experiments and demonstrate that P-Check not only improves reward accuracy but also enhances downstream personalized generation.
Score: 11.399221632873934
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent approaches in personalized reward modeling have primarily focused on leveraging user interaction history to align model judgments with individual preferences. However, existing approaches largely treat user context as a static or implicit conditioning signal, failing to capture the dynamic and multi-faceted nature of human judgment. In this paper, we propose P-Check, a novel personalized reward modeling framework, designed to train a plug-and-play checklist generator that synthesizes dynamic evaluation criteria for guiding the reward prediction. To better align these checklists with personalized nuances, we introduce Preference-Contrastive Criterion Weighting, a training strategy that assigns saliency scores to criteria based on their discriminative power for personalized judgment. We conduct extensive experiments and demonstrate that P-Check not only improves reward accuracy but also enhances downstream personalized generation, and remains robust in OOD scenarios.

Related papers

Synthetic Interaction Data for Scalable Personalization in Large Language Models [67.31884245564086]
We introduce a high-fidelity synthetic data generation framework called PersonaGym.<n>Unlike prior work that treats personalization as static persona-preference pairs, PersonaGym models a dynamic preference process.<n>We release PersonaAtlas, a large-scale, high-quality, and diverse synthetic dataset of high-fidelity multi-turn personalized interaction trajectories.
arXiv Detail & Related papers (2026-02-12T20:41:22Z)
P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [66.55381105691818]
We propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling.<n>P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics.<n>It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism.
arXiv Detail & Related papers (2026-02-12T16:07:22Z)
Unified Personalized Reward Model for Vision Generation [27.496220369122494]
We propose UnifiedReward-Flex, a unified personalized reward model for vision generation.<n>We first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT.<n>We then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment.
arXiv Detail & Related papers (2026-02-02T17:44:21Z)
One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment [55.86333374784959]
We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation.<n>We propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem.<n>We show that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.
arXiv Detail & Related papers (2026-01-26T17:55:52Z)
Probing Preference Representations: A Multi-Dimensional Evaluation and Analysis Method for Reward Models [63.00458229517523]
This work addresses the evaluation challenge of reward models by probing preference representations.<n>We construct a Multi-dimensional Reward Model Benchmark (MRMBench), a collection of six probing tasks for different preference dimensions.<n>We introduce an analysis method, inference-time probing, which identifies the dimensions used during the reward prediction and enhances its interpretability.
arXiv Detail & Related papers (2025-11-16T05:29:29Z)
PreferThinker: Reasoning-based Personalized Image Preference Assessment [83.66114370585976]
We propose a reasoning-based personalized image preference assessment framework.<n>It first predicts a user's preference profile from reference images.<n>It then provides interpretable, multi-dimensional scores and assessments of candidate images.
arXiv Detail & Related papers (2025-11-01T16:19:51Z)
Listwise Preference Diffusion Optimization for User Behavior Trajectories Prediction [41.53271688465831]
We formulate User Behavior Trajectory Prediction (UBTP) as a new task setting that explicitly models long-term user preferences.<n>We introduce Listwise Preference Diffusion Optimization (LPDO), a diffusion-based training framework that directly optimize structured preferences over entire item sequences.<n>To rigorously evaluate multi-step prediction quality, we propose the task-specific metric Sequential Match (SeqMatch), which measures exact trajectory agreement, and adopt Perplexity (PPL), which assesses probabilistic fidelity.
arXiv Detail & Related papers (2025-11-01T12:16:24Z)
Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning [22.252030067675065]
We propose Critique-Post-Edit, a robust reinforcement learning framework that enables more faithful and controllable personalization.<n>Our framework integrates two key components: (1) a Personalized Generative Reward Model (GRM) that provides multi-dimensional scores and textual critiques to resist reward hacking, and (2) a Critique-Post-Edit mechanism where the policy model revises its own outputs based on these critiques for more targeted and efficient learning.
arXiv Detail & Related papers (2025-10-21T17:40:03Z)
Interpretable Reward Modeling with Active Concept Bottlenecks [54.00085739303773]
We introduce Concept Bottleneck Reward Models (CB-RM), a reward modeling framework that enables interpretable preference learning.<n>Unlike standard RLHF methods that rely on opaque reward functions, CB-RM decomposes reward prediction into human-interpretable concepts.<n>We formalize an active learning strategy that dynamically acquires the most informative concept labels.
arXiv Detail & Related papers (2025-07-07T06:26:04Z)
Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment [35.68913976348608]
We introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework to iteratively infer and refine user profiles through dialogue.<n>We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue.
arXiv Detail & Related papers (2025-05-21T12:38:36Z)
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment [21.677859755364334]
Persona-judge is a novel discriminative paradigm that enables training-free personalized alignment with unseen preferences.<n>We show that Persona-judge offers a scalable and computationally efficient solution to personalized alignment.
arXiv Detail & Related papers (2025-04-17T05:50:13Z)
PURS: Personalized Unexpected Recommender System for Improving User Satisfaction [76.98616102965023]
We describe a novel Personalized Unexpected Recommender System (PURS) model that incorporates unexpectedness into the recommendation process. Extensive offline experiments on three real-world datasets illustrate that the proposed PURS model significantly outperforms the state-of-the-art baseline approaches.
arXiv Detail & Related papers (2021-06-05T01:33:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.