Related papers: SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents

URL: http://arxiv.org/abs/2310.11667v2
Date: Fri, 22 Mar 2024 18:52:15 GMT
Title: SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents
Authors: Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, Maarten Sap,
Abstract summary: We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and humans. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We find that GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills.
Score: 107.4138224020773
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Humans are social beings; we pursue social goals in our daily interactions, which is a crucial aspect of social intelligence. Yet, AI systems' abilities in this realm remain elusive. We present SOTOPIA, an open-ended environment to simulate complex social interactions between artificial agents and evaluate their social intelligence. In our environment, agents role-play and interact under a wide variety of scenarios; they coordinate, collaborate, exchange, and compete with each other to achieve complex social goals. We simulate the role-play interaction between LLM-based agents and humans within this task space and evaluate their performance with a holistic evaluation framework called SOTOPIA-Eval. With SOTOPIA, we find significant differences between these models in terms of their social intelligence, and we identify a subset of SOTOPIA scenarios, SOTOPIA-hard, that is generally challenging for all models. We find that on this subset, GPT-4 achieves a significantly lower goal completion rate than humans and struggles to exhibit social commonsense reasoning and strategic communication skills. These findings demonstrate SOTOPIA's promise as a general platform for research on evaluating and improving social intelligence in artificial agents.

Related papers

One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence [25.89075578734277]
This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework for AI.<n>OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms.<n>We show that trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking.
arXiv Detail & Related papers (2026-02-03T05:09:49Z)
S$^3$IT: A Benchmark for Spatially Situated Social Intelligence Test [26.79990069295221]
We introduce the Spatially Situated Social Intelligence Test (S$3$IT), a benchmark specifically designed to evaluate embodied social intelligence.<n>It is centered on a novel and challenging seat-ordering task, requiring an agent to arrange seating in a 3D environment for a group of large language model-driven NPCs.<n>Our framework generates a vast and diverse scenario space with controllable difficulty, compelling the agent to acquire preferences through active dialogue, perceive the environment via autonomous exploration, and perform multi-objective optimization within a complex constraint network.
arXiv Detail & Related papers (2025-12-23T02:36:56Z)
LIFELONG SOTOPIA: Evaluating Social Intelligence of Language Agents Over Lifelong Social Interactions [4.819825467587802]
We present a novel benchmark, LIFELONG-SOTOPIA, to perform a comprehensive evaluation of language agents.<n>We find that goal achievement and believability of all of the language models that we test decline through the whole interaction.<n>These findings show that we can use LIFELONG-SOTOPIA to evaluate the social intelligence of language agents over lifelong social interactions.
arXiv Detail & Related papers (2025-06-14T23:57:54Z)
The Coming Crisis of Multi-Agent Misalignment: AI Alignment Must Be a Dynamic and Social Process [13.959658276224266]
AI alignment with human values and preferences remains a core challenge.<n>As agents engage with one another, they must coordinate to accomplish both individual and collective goals.<n>Social structure can deter or shatter group and individual values.<n>We call on the AI community to treat human, preferential, and objective alignment as an interdependent concept.
arXiv Detail & Related papers (2025-06-01T16:39:43Z)
The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning [49.32390524168273]
Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions. We introduce a large-scale real-world Human Robot Social Interaction (HSRI) dataset to benchmark the capabilities of language models (LMs) and foundational models (FMs) Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions.
arXiv Detail & Related papers (2025-04-07T06:27:02Z)
AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios [38.878966229688054]
We introduce AgentSense: Benchmarking Social Intelligence of Language Agents through Interactive Scenarios. Drawing on Dramaturgical Theory, AgentSense employs a bottom-up approach to create 1,225 diverse social scenarios constructed from extensive scripts. We analyze goals using ERG theory and conduct comprehensive experiments. Our findings highlight that LLMs struggle with goals in complex social scenarios, especially high-level growth needs, and even GPT-4o requires improvement in private information reasoning.
arXiv Detail & Related papers (2024-10-25T07:04:16Z)
Advancing Social Intelligence in AI Agents: Technical Challenges and Open Questions [67.60397632819202]
Building socially-intelligent AI agents (Social-AI) is a multidisciplinary, multimodal research goal. We identify a set of underlying technical challenges and open questions for researchers across computing communities to advance Social-AI.
arXiv Detail & Related papers (2024-04-17T02:57:42Z)
SocialBench: Sociality Evaluation of Role-Playing Conversational Agents [85.6641890712617]
Large language models (LLMs) have advanced the development of various AI conversational agents. SocialBench is the first benchmark designed to evaluate the sociality of role-playing conversational agents at both individual and group levels. We find that agents excelling in individual level does not imply their proficiency in group level.
arXiv Detail & Related papers (2024-03-20T15:38:36Z)
SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents [73.35393511272791]
We propose an interactive learning method, SOTOPIA-$pi$, improving the social intelligence of language agents. This method leverages behavior cloning and self-reinforcement training on filtered social interaction data according to large language model (LLM) ratings.
arXiv Detail & Related papers (2024-03-13T17:17:48Z)
SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement Learning Agents [23.719833581321033]
Building embodied autonomous agents capable of participating in social interactions with humans is one of the main challenges in AI. We argue that aiming towards human-level AI requires a broader set of key social skills. We present SocialAI, a benchmark to assess the acquisition of social skills of DRL agents.
arXiv Detail & Related papers (2021-07-02T10:39:18Z)
PHASE: PHysically-grounded Abstract Social Events for Machine Social Perception [50.551003004553806]
We create a dataset of physically-grounded abstract social events, PHASE, that resemble a wide range of real-life social interactions. Phase is validated with human experiments demonstrating that humans perceive rich interactions in the social events. As a baseline model, we introduce a Bayesian inverse planning approach, SIMPLE, which outperforms state-of-the-art feed-forward neural networks.
arXiv Detail & Related papers (2021-03-02T18:44:57Z)
Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration [116.28433607265573]
We introduce Watch-And-Help (WAH), a challenge for testing social intelligence in AI agents. In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently. We build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines.
arXiv Detail & Related papers (2020-10-19T21:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.