Related papers: Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

URL: http://arxiv.org/abs/2505.02847v3
Date: Wed, 21 May 2025 13:45:40 GMT
Title: Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Authors: Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li,
Abstract summary: Sentient Agent as a Judge (SAGE) is an evaluation framework for large language models.<n>SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction.<n>SAGE provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
Score: 75.85319609088354
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.

Related papers

RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents [67.46032287312339]
Large language models (LLMs) excel at logical and algorithmic reasoning, yet their emotional intelligence (EQ) still lags far behind their cognitive prowess.<n>We introduce RLVER, the first end-to-end reinforcement learning framework that leverages verifiable emotion rewards from simulated users.<n>Our results show that RLVER is a practical route toward emotionally intelligent and broadly capable language agents.
arXiv Detail & Related papers (2025-07-03T18:33:18Z)
Are You Listening to Me? Fine-Tuning Chatbots for Empathetic Dialogue [0.5849783371898033]
We explore how Large Language Models (LLMs) respond when tasked with generating emotionally rich interactions.<n>We analyzed the emotional progression of the dialogues using both sentiment analysis (via VADER) and expert assessments.
arXiv Detail & Related papers (2025-07-03T11:32:41Z)
MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems [20.58639538648743]
We introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition.<n>Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios.<n>This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions.
arXiv Detail & Related papers (2025-05-25T02:32:57Z)
Personality-affected Emotion Generation in Dialog Systems [67.40609683389947]
We propose a new task, Personality-affected Emotion Generation, to generate emotion based on the personality given to the dialog system. We analyze the challenges in this task, i.e., (1) heterogeneously integrating personality and emotional factors and (2) extracting multi-granularity emotional information in the dialog context. Results suggest that by adopting our method, the emotion generation performance is improved by 13% in macro-F1 and 5% in weighted-F1 from the BERT-base model.
arXiv Detail & Related papers (2024-04-03T08:48:50Z)
SocialBench: Sociality Evaluation of Role-Playing Conversational Agents [85.6641890712617]
Large language models (LLMs) have advanced the development of various AI conversational agents. SocialBench is the first benchmark designed to evaluate the sociality of role-playing conversational agents at both individual and group levels. We find that agents excelling in individual level does not imply their proficiency in group level.
arXiv Detail & Related papers (2024-03-20T15:38:36Z)
Can Generative Agents Predict Emotion? [0.0]
Large Language Models (LLMs) have demonstrated a number of human-like abilities, however the empathic understanding and emotional state of LLMs is yet to be aligned to that of humans. We investigate how the emotional state of generative LLM agents evolves as they perceive new events, introducing a novel architecture in which new experiences are compared to past memories.
arXiv Detail & Related papers (2024-02-06T18:39:43Z)
Rational Sensibility: LLM Enhanced Empathetic Response Generation Guided by Self-presentation Theory [8.439724621886779]
The development of Large Language Models (LLMs) provides human-centered Artificial General Intelligence (AGI) with a glimmer of hope. Empathy serves as a key emotional attribute of humanity, playing an irreplaceable role in human-centered AGI. In this paper, we design an innovative encoder module inspired by self-presentation theory in sociology, which specifically processes sensibility and rationality sentences in dialogues.
arXiv Detail & Related papers (2023-12-14T07:38:12Z)
Emotionally Numb or Empathetic? Evaluating How LLMs Feel Using EmotionBench [83.41621219298489]
We evaluate Large Language Models' (LLMs) anthropomorphic capabilities using the emotion appraisal theory from psychology. We collect a dataset containing over 400 situations that have proven effective in eliciting the eight emotions central to our study. We conduct a human evaluation involving more than 1,200 subjects worldwide.
arXiv Detail & Related papers (2023-08-07T15:18:30Z)
CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation [59.8935454665427]
Empathetic dialogue models usually consider only the affective aspect or treat cognition and affection in isolation. We propose the CASE model for empathetic dialogue generation.
arXiv Detail & Related papers (2022-08-18T14:28:38Z)
Constructing Emotion Consensus and Utilizing Unpaired Data for Empathetic Dialogue Generation [22.2430593119389]
We propose a dual-generative model, Dual-Emp, to simultaneously construct the emotion consensus and utilize some external unpaired data. Our method outperforms competitive baselines in producing coherent and empathetic responses.
arXiv Detail & Related papers (2021-09-16T07:57:01Z)
Towards Socially Intelligent Agents with Mental State Transition and Human Utility [97.01430011496576]
We propose to incorporate a mental state and utility model into dialogue agents. The hybrid mental state extracts information from both the dialogue and event observations. The utility model is a ranking model that learns human preferences from a crowd-sourced social commonsense dataset.
arXiv Detail & Related papers (2021-03-12T00:06:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.