MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
- URL: http://arxiv.org/abs/2502.16671v1
- Date: Sun, 23 Feb 2025 18:05:49 GMT
- Title: MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
- Authors: Hengzhi Li, Megan Tjandrasuwita, Yi R. Fung, Armando Solar-Lezama, Paul Pu Liang,
- Abstract summary: We tap into data rich in nonverbal and social interactions -- mime videos.<n>We evaluate state-of-the-art video large language models (vLLMs) and find that their overall accuracy ranges from 15-30%.<n>Our analysis reveals that vLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions.
- Score: 27.930709161679424
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Socially intelligent AI that can understand and interact seamlessly with humans in daily lives is increasingly important as AI becomes more closely integrated with peoples' daily activities. However, current works in artificial social reasoning all rely on language-only, or language-dominant approaches to benchmark and training models, resulting in systems that are improving in verbal communication but struggle with nonverbal social understanding. To address this limitation, we tap into a novel source of data rich in nonverbal and social interactions -- mime videos. Mimes refer to the art of expression through gesture and movement without spoken words, which presents unique challenges and opportunities in interpreting non-verbal social communication. We contribute a new dataset called MimeQA, obtained by sourcing 221 videos from YouTube, through rigorous annotation and verification, resulting in a benchmark with 101 videos and 806 question-answer pairs. Using MimeQA, we evaluate state-of-the-art video large language models (vLLMs) and find that their overall accuracy ranges from 15-30%. Our analysis reveals that vLLMs often fail to ground imagined objects and over-rely on the text prompt while ignoring subtle nonverbal interactions. Our data resources are released at https://github.com/MIT-MI/MimeQA to inspire future work in foundation models that embody true social intelligence capable of interpreting non-verbal human interactions.
Related papers
- Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering [13.775516653315103]
Social intelligence is essential for effective communication and adaptive responses.
Current video-based methods for social intelligence rely on general video recognition or emotion recognition techniques.
We propose the Looped Video Debating framework, which integrates Large Language Models with visual information.
arXiv Detail & Related papers (2025-03-27T06:14:21Z) - Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning [31.196865401472664]
We train language models to have productive discussions about their environment in natural language without any human demonstrations.<n>We leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication.<n>We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions.
arXiv Detail & Related papers (2025-02-09T22:44:45Z) - The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion [46.01825432018138]
We propose a novel framework that unifies verbal and non-verbal language using multimodal language models.<n>Our model achieves state-of-the-art performance on co-speech gesture generation.<n>We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications.
arXiv Detail & Related papers (2024-12-13T19:33:48Z) - Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion.
In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation.
We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z) - Nonverbal Interaction Detection [83.40522919429337]
This work addresses a new challenge of understanding human nonverbal interaction in social contexts.
We contribute a novel large-scale dataset, called NVI, which is meticulously annotated to include bounding boxes for humans and corresponding social groups.
Second, we establish a new task NVI-DET for nonverbal interaction detection, which is formalized as identifying triplets in the form individual, group, interaction> from images.
Third, we propose a nonverbal interaction detection hypergraph (NVI-DEHR), a new approach that explicitly models high-order nonverbal interactions using hypergraphs.
arXiv Detail & Related papers (2024-07-11T02:14:06Z) - SOTOPIA-$π$: Interactive Learning of Socially Intelligent Language Agents [73.35393511272791]
We propose an interactive learning method, SOTOPIA-$pi$, improving the social intelligence of language agents.
This method leverages behavior cloning and self-reinforcement training on filtered social interaction data according to large language model (LLM) ratings.
arXiv Detail & Related papers (2024-03-13T17:17:48Z) - SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM)
SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation.
Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z) - Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue [73.69510478736483]
Large language models (LLMs) can generate fluent, coherent, and diverse responses.
However, they lack a crucial ability: communication skills.
This article aims to empower LLMs with communication skills through inner monologues.
Experimental results show that the proposed CSIM strategy improves the backbone models and outperforms the baselines.
arXiv Detail & Related papers (2023-11-13T16:19:42Z) - Affective Faces for Goal-Driven Dyadic Communication [16.72177738101024]
We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation.
Our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context.
arXiv Detail & Related papers (2023-01-26T05:00:09Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement
Learning Agents [23.719833581321033]
Building embodied autonomous agents capable of participating in social interactions with humans is one of the main challenges in AI.
We argue that aiming towards human-level AI requires a broader set of key social skills.
We present SocialAI, a benchmark to assess the acquisition of social skills of DRL agents.
arXiv Detail & Related papers (2021-07-02T10:39:18Z) - SocialAI 0.1: Towards a Benchmark to Stimulate Research on
Socio-Cognitive Abilities in Deep Reinforcement Learning Agents [23.719833581321033]
Building embodied autonomous agents capable of participating in social interactions with humans is one of the main challenges in AI.
Current approaches focus on language as a communication tool in very simplified and non diverse social situations.
We argue that aiming towards human-level AI requires a broader set of key social skills.
arXiv Detail & Related papers (2021-04-27T14:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.