MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents
- URL: http://arxiv.org/abs/2602.21941v1
- Date: Tue, 24 Feb 2026 02:53:58 GMT
- Title: MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents
- Authors: Zhenyu Wang, Xiaofen Xing, Yirong Chen, Xiangmin Xu,
- Abstract summary: MERRY is a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents.<n>We transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task.<n>We conduct extensive evaluations based on MERRY.
- Score: 41.829135334587626
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal Role-Playing Agents (MRPAs) are attracting increasing attention due to their ability to deliver more immersive multimodal emotional interactions. However, existing studies still rely on pure textual benchmarks to evaluate the text responses of MRPAs, while delegating the assessment of their multimodal expressions solely to modality-synthesis metrics. This evaluation paradigm, on the one hand, entangles semantic assessment with modality generation, leading to ambiguous error attribution, and on the other hand remains constrained by the heavy reliance on human judgment. To this end, we propose MERRY, a semantically decoupled evaluation framework for assessing Multimodal Emotional and Role consistencies of Role-playing agents. This framework introduce five refined metrics for EC and three for RC. Notably, we transform the traditional subjective scoring approach into a novel bidirectional-evidence-finding task, significantly improving the human agreement of LLM-as-Judge evaluations. Based on MERRY, we conduct extensive evaluations. Our empirical results primarily reveal that: (1) Training on synthetic datasets tends to reduce emotional consistency, whereas training on real-world datasets improves it; (2) Existing models suffer from emotional templatization and simplification, exhibiting positive-bias and performance bottleneck in fine-grained negative emotions; (3) Simple prompting method strengthens the weak models but constrains the strong ones, while simple fine-tuning method suffers from poor role generalization. Codes and dataset are available.
Related papers
- Evaluating Zero-Shot and One-Shot Adaptation of Small Language Models in Leader-Follower Interaction [1.3511057160494195]
Leader-follower interaction is an important paradigm in human-robot interaction (HRI)<n>Small language models (SLMs) offer a potential alternative, but their effectiveness for role classification in HRI has not been systematically evaluated.
arXiv Detail & Related papers (2026-02-26T18:20:26Z) - Reinforcing Trustworthiness in Multimodal Emotional Support Systems [19.59836948857841]
Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses.<n>We introduce textsc MultiMood, a new framework that leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses aligned with professional therapeutic standards.
arXiv Detail & Related papers (2025-11-13T06:28:07Z) - Speech-DRAME: A Framework for Human-Aligned Benchmarks in Speech Role-Play [68.54773980519457]
Speech-DRAME is a unified framework that contributes at three levels.<n>It provides the first comprehensive, reproducible foundation for assessing spoken role-play.
arXiv Detail & Related papers (2025-11-03T06:12:40Z) - Consistently Simulating Human Personas with Multi-Turn Reinforcement Learning [52.07170679746533]
Large Language Models (LLMs) are increasingly used to simulate human users in interactive settings such as therapy, education, and social role-play.<n>We introduce a unified framework for evaluating and improving persona consistency in LLM-generated dialogue.<n>We define three automatic metrics: prompt-to-line consistency, line-to-line consistency, and Q&A consistency, that capture different types of persona drift and validate each against human annotations.
arXiv Detail & Related papers (2025-10-31T19:40:41Z) - DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates [10.609797175227644]
We introduce DEBATE, the first large-scale empirical benchmark to evaluate the authenticity of the interaction between multi-agent role-playing LLMs.<n>We systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics.
arXiv Detail & Related papers (2025-10-29T02:21:10Z) - Evaluating LLM Alignment on Personality Inference from Real-World Interview Data [7.061237517845673]
Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding.<n>Their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored.<n>We introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores.
arXiv Detail & Related papers (2025-09-16T16:54:35Z) - Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators [45.00450861498919]
Flex-Judge is a reasoning-guided multimodal judge model that leverages minimal textual reasoning data.<n>Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches.
arXiv Detail & Related papers (2025-05-24T08:50:53Z) - T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z) - Emphasising Structured Information: Integrating Abstract Meaning Representation into LLMs for Enhanced Open-Domain Dialogue Evaluation [31.633351104278194]
Our framework integrates AMR graph information through a gating mechanism for enhanced semantic representation learning.<n>Our framework achieves strong correlations with human judgments across multiple datasets, establishing a new benchmark for dialogue evaluation.
arXiv Detail & Related papers (2024-04-01T14:11:45Z) - AntEval: Evaluation of Social Interaction Competencies in LLM-Driven
Agents [65.16893197330589]
Large Language Models (LLMs) have demonstrated their ability to replicate human behaviors across a wide range of scenarios.
However, their capability in handling complex, multi-character social interactions has yet to be fully explored.
We introduce the Multi-Agent Interaction Evaluation Framework (AntEval), encompassing a novel interaction framework and evaluation methods.
arXiv Detail & Related papers (2024-01-12T11:18:00Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.