Social Caption: Evaluating Social Understanding in Multimodal Models
- URL: http://arxiv.org/abs/2601.14569v1
- Date: Wed, 21 Jan 2026 01:10:42 GMT
- Title: Social Caption: Evaluating Social Understanding in Multimodal Models
- Authors: Bhaavanaa Thumu, Leena Mathur, Youssouf Kebe, Louis-Philippe Morency,
- Abstract summary: Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions.<n>We introduce Social Caption, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs.<n>We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context.
- Score: 23.008965893705767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Social understanding abilities are crucial for multimodal large language models (MLLMs) to interpret human social interactions. We introduce Social Caption, a framework grounded in interaction theory to evaluate social understanding abilities of MLLMs along three dimensions: Social Inference (SI), the ability to make accurate inferences about interactions; Holistic Social Analysis (HSA), the ability to generate comprehensive descriptions of interactions; Directed Social Analysis (DSA), the ability to extract relevant social information from interactions. We analyze factors influencing model performance in social understanding, such as scale, architectural design, and spoken context. Experiments with MLLM judges contribute insights about scaling automated evaluation of multimodal social understanding.
Related papers
- Neural Synchrony Between Socially Interacting Language Models [52.74586779814636]
Large language models (LLMs) are widely accepted as powerful approximations of human behavior.<n>It remains controversial whether they can be meaningfully compared to human social minds.
arXiv Detail & Related papers (2026-02-19T20:33:54Z) - Social Simulations with Large Language Model Risk Utopian Illusion [61.358959720048354]
We introduce a systematic framework for analyzing large language models' behavior in social simulation.<n>Our approach simulates multi-agent interactions through chatroom-style conversations and analyzes them across five linguistic dimensions.<n>Our findings reveal that LLMs do not faithfully reproduce genuine human behavior but instead reflect overly idealized versions of it.
arXiv Detail & Related papers (2025-10-24T06:08:41Z) - SocialNLI: A Dialogue-Centric Social Inference Dataset [49.60157928163403]
We introduce SocialNLI -- the first social dialogue inference dataset.<n>SocialNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances.<n>We evaluate reasoning models theory-of-mind ability through multi-step counterfactual reasoning.
arXiv Detail & Related papers (2025-10-06T23:42:01Z) - Social World Models [35.672466808871945]
We introduce a novel structured social world representation formalism (S3AP)<n>S3AP represents social interactions as structureds, such as state, observation, agent actions, and mental states.<n>We show S3AP can help LLMs better understand social narratives across 5 social reasoning tasks.<n>We then induce social world models from these structured representations, demonstrating their ability to predict future social dynamics.
arXiv Detail & Related papers (2025-08-30T16:52:58Z) - SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z) - SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models [41.68365456601248]
We introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning.<n>SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty.<n>It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms.
arXiv Detail & Related papers (2025-05-29T17:47:36Z) - R^3-VQA: "Read the Room" by Video Social Reasoning [26.694917467429207]
"Read the room" is a significant social reasoning capability in human daily life.<n>We contribute a valuable, high-quality, and comprehensive video dataset named R3-VQA.
arXiv Detail & Related papers (2025-05-07T05:55:45Z) - The Human Robot Social Interaction (HSRI) Dataset: Benchmarking Foundational Models' Social Reasoning [49.32390524168273]
Our work aims to advance the social reasoning of embodied artificial intelligence (AI) agents in real-world social interactions.<n>We introduce a large-scale real-world Human Robot Social Interaction (HSRI) dataset to benchmark the capabilities of language models (LMs) and foundational models (FMs)<n>Our dataset consists of 400 real-world human social robot interaction videos and over 10K annotations, detailing the robot's social errors, competencies, rationale, and corrective actions.
arXiv Detail & Related papers (2025-04-07T06:27:02Z) - Social Genome: Grounded Social Reasoning Abilities of Multimodal Models [61.88413918026431]
Social reasoning abilities are crucial for AI systems to interpret and respond to multimodal human communication and interaction within social contexts.<n>We introduce SOCIAL GENOME, the first benchmark for fine-grained, grounded social reasoning abilities of multimodal models.
arXiv Detail & Related papers (2025-02-21T00:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.