SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
- URL: http://arxiv.org/abs/2506.05425v1
- Date: Thu, 05 Jun 2025 05:51:35 GMT
- Title: SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning
- Authors: Fanqi Kong, Weiqin Zu, Xinyu Chen, Yaodong Yang, Song-Chun Zhu, Xue Feng,
- Abstract summary: We introduce SIV-Bench, a novel video benchmark for evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP)<n>SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline.<n>It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text.
- Score: 53.16179295245888
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rich and multifaceted nature of human social interaction, encompassing multimodal cues, unobservable relations and mental states, and dynamical behavior, presents a formidable challenge for artificial intelligence. To advance research in this area, we introduce SIV-Bench, a novel video benchmark for rigorously evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It is originally collected from TikTok and YouTube, covering a wide range of video genres, presentation styles, and linguistic and cultural backgrounds. It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text. Our comprehensive experiments on leading MLLMs reveal that while models adeptly handle SSU, they significantly struggle with SSR and SDP, where Relation Inference (RI) is an acute bottleneck, as further examined in our analysis. Our study also confirms the critical role of transcribed dialogue in aiding comprehension of complex social interactions. By systematically identifying current MLLMs' strengths and limitations, SIV-Bench offers crucial insights to steer the development of more socially intelligent AI. The dataset and code are available at https://kfq20.github.io/sivbench/.
Related papers
- VEGAS: Towards Visually Explainable and Grounded Artificial Social Intelligence [22.086567828557683]
Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model's social intelligence level.<n>We propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model.
arXiv Detail & Related papers (2025-04-03T02:48:21Z) - Towards Online Multi-Modal Social Interaction Understanding [36.37278022436327]
We propose an online MMSI setting, where the model must resolve MMSI tasks using only historical information, such as recorded dialogues and video streams.<n>We develop a novel framework, named Online-MMSI-VLM, that leverages two complementary strategies: multi-party conversation forecasting and social-aware visual prompting.<n>Our method achieves state-of-the-art performance and significantly outperforms baseline models, indicating its effectiveness on Online-MMSI.
arXiv Detail & Related papers (2025-03-25T17:17:19Z) - Towards Anthropomorphic Conversational AI Part I: A Practical Framework [49.62013440962072]
We introduce a multi- module framework designed to replicate the key aspects of human intelligence involved in conversations.<n>In the second stage of our approach, these conversational data, after filtering and labeling, can serve as training and testing data for reinforcement learning.
arXiv Detail & Related papers (2025-02-28T03:18:39Z) - HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z) - Engagement-Driven Content Generation with Large Language Models [8.049552839071918]
Large Language Models (LLMs) demonstrate significant persuasive capabilities in one-on-one interactions.<n>Their influence within social networks, where interconnected users and complex opinion dynamics pose unique challenges, remains underexplored.<n>This paper addresses the research question: emphCan LLMs generate meaningful content that maximizes user engagement on social networks?
arXiv Detail & Related papers (2024-11-20T10:40:08Z) - How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES)
CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions.
Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z) - MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms [25.73585435351771]
This paper introduces MM-Soc, a benchmark designed to evaluate Multimodal Large Language Models' understanding of social media content.
MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset.
Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks.
arXiv Detail & Related papers (2024-02-21T22:27:40Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - Face-to-Face Contrastive Learning for Social Intelligence
Question-Answering [55.90243361923828]
multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics.
We propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions.
We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
arXiv Detail & Related papers (2022-07-29T20:39:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.