ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents
- URL: http://arxiv.org/abs/2505.23923v1
- Date: Thu, 29 May 2025 18:15:18 GMT
- Title: ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents
- Authors: Feiteng Fang, Ting-En Lin, Yuchuan Wu, Xiong Liu, Xiang Huang, Dingwei Chen, Jing Ye, Haonan Zhang, Liang Zhu, Hamid Alinejad-Rokny, Min Yang, Fei Huang, Yongbin Li,
- Abstract summary: Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions.<n>We propose ChARM, a Character-based Act-adaptive Reward Model.<n>We introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs.
- Score: 60.325553329946
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at https://github.com/calubkk/ChARM.
Related papers
- Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR [67.66592867046229]
Character-R1 is a framework designed to provide verifiable reward signals for effective role-aware reasoning.<n>Our framework comprises three core designs: Cognitive Focus Reward, Reference-Guided Reward and Character-Conditioned Reward Normalization.
arXiv Detail & Related papers (2026-01-08T05:33:37Z) - RoleRMBench & RoleRM: Towards Reward Modeling for Profile-Based Role Play in Dialogue Systems [85.16327248973387]
We develop RoleRM, a reward model trained with Continuous Implicit Preferences (CIP)<n>We show RoleRM surpasses strong open- and closed-source reward models by over 24% on average.<n>Our findings highlight the importance of continuous preference representation and annotation consistency, establishing a foundation for subjective alignment in human-centered dialogue systems.
arXiv Detail & Related papers (2025-12-11T12:04:46Z) - Aligning Large Language Models via Fully Self-Synthetic Data [20.05693955243206]
Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets.<n>In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment.<n>Experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval2.0.
arXiv Detail & Related papers (2025-10-08T05:07:45Z) - ELIXIR: Efficient and LIghtweight model for eXplaIning Recommendations [1.9711529297777448]
We propose ELIXIR, a multi-task model combining rating prediction with personalized review generation.<n>ELIXIR jointly learns global and aspect-specific representations of users and items, optimizing overall rating, aspect-level ratings, and review generation.<n>Based on a T5-small (60M) model, we demonstrate the effectiveness of our aspect-based architecture in guiding text generation in a personalized context.
arXiv Detail & Related papers (2025-08-27T23:01:11Z) - What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context [56.590259941275434]
RecPO is a preference optimization framework for sequential recommendation.<n>It exploits adaptive reward margins based on inferred preference hierarchies and temporal signals.<n>It mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.
arXiv Detail & Related papers (2025-06-02T21:09:29Z) - FLoRA: Sample-Efficient Preference-based RL via Low-Rank Style Adaptation of Reward Functions [14.26977110112456]
Preference-based reinforcement learning is a suitable approach for style adaptation of pre-trained robotic behavior.<n>Recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences.<n>We show that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks.
arXiv Detail & Related papers (2025-04-14T09:04:14Z) - PILAF: Optimal Human Preference Sampling for Reward Modeling [14.336058926701432]
We propose Policy-Interpolated Learning for Aligned Feedback (PILAF), a novel response sampling strategy for preference labeling.<n>PILAF explicitly aligns preference learning with maximizing the underlying oracle reward.
arXiv Detail & Related papers (2025-02-06T18:09:00Z) - OpenCharacter: Training Customizable Role-Playing LLMs with Large-Scale Synthetic Personas [65.83634577897564]
This study explores a large-scale data synthesis approach to equip large language models with character generalization capabilities.<n>We begin by synthesizing large-scale character profiles using personas from Persona Hub.<n>We then explore two strategies: response rewriting and response generation, to create character-aligned instructional responses.
arXiv Detail & Related papers (2025-01-26T07:07:01Z) - Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning [12.742158403867002]
Reinforcement Learning from Human Feedback is a powerful paradigm for aligning foundation models to human values and preferences.
Current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population.
We develop a class of multimodal RLHF methods to address the need for pluralistic alignment.
arXiv Detail & Related papers (2024-08-19T15:18:30Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - TransAct: Transformer-based Realtime User Action Model for
Recommendation at Pinterest [17.247452803197362]
This paper presents Pinterest's ranking architecture for Homefeed.
We propose TransAct, a sequential model that extracts users' short-term preferences from their realtime activities.
We describe the results of ablation studies, the challenges we faced during productionization, and the outcome of an online A/B experiment.
arXiv Detail & Related papers (2023-05-31T23:45:29Z) - Robust Preference Learning for Storytelling via Contrastive
Reinforcement Learning [53.92465205531759]
Controlled automated story generation seeks to generate natural language stories satisfying constraints from natural language critiques or preferences.
We train a contrastive bi-encoder model to align stories with human critiques, building a general purpose preference model.
We further fine-tune the contrastive reward model using a prompt-learning technique to increase story generation robustness.
arXiv Detail & Related papers (2022-10-14T13:21:33Z) - Leveraging Historical Interaction Data for Improving Conversational
Recommender System [105.90963882850265]
We propose a novel pre-training approach to integrate item- and attribute-based preference sequence.
Experiment results on two real-world datasets have demonstrated the effectiveness of our approach.
arXiv Detail & Related papers (2020-08-19T03:43:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.