How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks
- URL: http://arxiv.org/abs/2505.04628v1
- Date: Fri, 04 Apr 2025 08:59:01 GMT
- Title: How Social is It? A Benchmark for LLMs' Capabilities in Multi-user Multi-turn Social Agent Tasks
- Authors: Yusen Wu, Junwu Xiong, Xiaotie Deng,
- Abstract summary: Large language models (LLMs) play roles in multi-user, multi-turn social agent tasks.<n>We propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities.<n>HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs.
- Score: 6.487500253901779
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs' capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM's social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs' social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.
Related papers
- SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z) - SocialJax: An Evaluation Suite for Multi-agent Reinforcement Learning in Sequential Social Dilemmas [3.897833712166508]
Sequential social dilemmas pose a significant challenge in the field of multi-agent reinforcement learning.<n>SocialJax is a suite of sequential social dilemma environments and algorithms implemented in JAX.<n>JAX is a high-performance numerical computing library for Python.
arXiv Detail & Related papers (2025-03-18T16:03:59Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation [51.44040615856536]
This paper analyzes large language models' ability to simulate social media engagement through action guided response generation.<n>We benchmark GPT-4o-mini, O1-mini, and DeepSeek-R1 in social media engagement simulation regarding a major societal event.
arXiv Detail & Related papers (2025-02-17T17:43:08Z) - Social Debiasing for Fair Multi-modal LLMs [55.8071045346024]
Multi-modal Large Language Models (MLLMs) have advanced significantly, offering powerful vision-language understanding capabilities.
However, these models often inherit severe social biases from their training datasets, leading to unfair predictions based on attributes like race and gender.
This paper addresses the issue of social biases in MLLMs by i) Introducing a comprehensive Counterfactual dataset with Multiple Social Concepts (CMSC) and ii) Proposing an Anti-Stereotype Debiasing strategy (ASD)
arXiv Detail & Related papers (2024-08-13T02:08:32Z) - SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents [73.35393511272791]
We propose an interactive learning method, SOTOPIA-$pi$, improving the social intelligence of language agents.
This method leverages behavior cloning and self-reinforcement training on filtered social interaction data according to large language model (LLM) ratings.
arXiv Detail & Related papers (2024-03-13T17:17:48Z) - MetaAgents: Simulating Interactions of Human Behaviors for LLM-based
Task-oriented Coordination via Collaborative Generative Agents [27.911816995891726]
We introduce collaborative generative agents, endowing LLM-based Agents with consistent behavior patterns and task-solving abilities.
We propose a novel framework that equips collaborative generative agents with human-like reasoning abilities and specialized skills.
Our work provides valuable insights into the role and evolution of Large Language Models in task-oriented social simulations.
arXiv Detail & Related papers (2023-10-10T10:17:58Z) - Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View [60.80731090755224]
This paper probes the collaboration mechanisms among contemporary NLP systems by practical experiments with theoretical insights.
We fabricate four unique societies' comprised of LLM agents, where each agent is characterized by a specific trait' (easy-going or overconfident) and engages in collaboration with a distinct thinking pattern' (debate or reflection)
Our results further illustrate that LLM agents manifest human-like social behaviors, such as conformity and consensus reaching, mirroring social psychology theories.
arXiv Detail & Related papers (2023-10-03T15:05:52Z) - Do LLMs Understand Social Knowledge? Evaluating the Sociability of Large
Language Models with SocKET Benchmark [14.922083834969323]
Large language models (LLMs) have been shown to perform well at a variety of syntactic, discourse, and reasoning tasks.
We introduce a new theory-driven benchmark, SocKET, that contains 58 NLP tasks testing social knowledge.
arXiv Detail & Related papers (2023-05-24T09:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.