Related papers: MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind

URL: http://arxiv.org/abs/2507.04415v2
Date: Sun, 21 Sep 2025 22:36:05 GMT
Title: MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind
Authors: Emilio Villa-Cueva, S M Masrur Ahmed, Rendi Chevi, Jan Christian Blaise Cruz, Kareem Elzeky, Fermin Cristobal, Alham Fikri Aji, Skyler Wang, Rada Mihalcea, Thamar Solorio,
Abstract summary: MoMentS (Multimodal Mental States) is a benchmark for building socially intelligent multimodal agents.<n>MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories.<n>We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively.
Score: 41.188841829937466
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MoMentS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters' mental states. We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively. For audio, models that process dialogues as audio do not consistently outperform transcript-based inputs. Our findings highlight the need to improve multimodal integration and point to open challenges that must be addressed to advance AI's social understanding.

Related papers

MMhops-R1: Multimodal Multi-hop Reasoning [89.68086555694084]
We introduce MMhops, a novel benchmark designed to evaluate and foster multi-modal multi-hop reasoning.<n> MMhops dataset comprises two challenging task formats, Bridging and Comparison.<n>We propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation framework for dynamic reasoning.
arXiv Detail & Related papers (2025-12-15T17:29:02Z)
Multi-speaker Attention Alignment for Multimodal Social Interaction [23.550501177885625]
Social interaction in video requires reasoning over a dynamic interplay of verbal and non-verbal cues.<n>While Multimodal Large Language Models (MLLMs) are natural candidates, simply adding visual inputs yields surprisingly inconsistent gains on social tasks.<n>We propose a multimodal multi-speaker attention alignment method that can be integrated into existing MLLMs.
arXiv Detail & Related papers (2025-11-22T07:35:47Z)
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions [26.074938251210842]
Despite advanced reasoning capabilities, state-of-the-art Multimodal Large Language Models (MLLMs) demonstrably lack a core component of human intelligence.<n>We introduce a new task, Multimodal Interactive Deception Assessment (MIDA)<n>We present a novel multimodal dataset providing synchronized video and text with verifiable ground-truth labels for every statement.
arXiv Detail & Related papers (2025-11-20T10:44:21Z)
MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs [61.70050081221131]
MVU-Eval is the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs.<n>Our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos.<n>These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics.
arXiv Detail & Related papers (2025-11-10T16:02:33Z)
MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues [38.63457491325088]
We introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues.<n>Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues.<n>These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring.
arXiv Detail & Related papers (2025-10-20T16:38:40Z)
SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning [53.16179295245888]
We introduce SIV-Bench, a novel video benchmark for evaluating the capabilities of Multimodal Large Language Models (MLLMs) across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP)<n>SIV-Bench features 2,792 video clips and 8,792 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline.<n>It also includes a dedicated setup for analyzing the impact of different textual cues-original on-screen text, added dialogue, or no text.
arXiv Detail & Related papers (2025-06-05T05:51:35Z)
Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models [26.14137626882127]
Large Multimodal Models (LMMs) have recently demonstrated remarkable visual understanding performance on both vision-language and vision-centric tasks.<n>We present a unified visual reasoning mechanism that enables LMMs to solve complicated compositional problems.<n>Our trained model, Griffon-R, has the ability of end-to-end automatic understanding, self-thinking, and reasoning answers.
arXiv Detail & Related papers (2025-05-27T05:50:25Z)
Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning [32.95008932216176]
We introduce MMDiag, a multi-turn multimodal dialogue dataset.<n>We also present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities.
arXiv Detail & Related papers (2025-03-10T07:32:53Z)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark [73.27104042215207]
We introduce EMMA, a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding.<n>EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality.<n>Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks.
arXiv Detail & Related papers (2025-01-09T18:55:52Z)
HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data [55.739633494946204]
We present HumanVBench, an innovative benchmark meticulously crafted to bridge gaps in the evaluation of video MLLMs.<n>HumanVBench comprises 16 carefully designed tasks that explore two primary dimensions: inner emotion and outer manifestations, spanning static and dynamic, basic and complex, as well as single-modal and cross-modal aspects.<n>A comprehensive evaluation across 22 SOTA video MLLMs reveals notable limitations in current performance, especially in cross-modal and emotion perception.
arXiv Detail & Related papers (2024-12-23T13:45:56Z)
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models [60.08485416687596]
Chain of Multi-modal Thought (CoMT) benchmark aims to mimic human-like reasoning that inherently integrates visual operation.<n>We evaluate various LVLMs and strategies on CoMT, revealing some key insights into the capabilities and limitations of the current approaches.
arXiv Detail & Related papers (2024-12-17T14:10:16Z)
Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z)
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind [10.079620078670589]
We introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark.<n>We provide video and text descriptions of people's multi-modal behavior in realistic household environments.<n>We then ask questions about people's goals, beliefs, and beliefs about others' goals.
arXiv Detail & Related papers (2024-08-22T17:41:45Z)
SoMeLVLM: A Large Vision Language Model for Social Media Processing [78.47310657638567]
We introduce a Large Vision Language Model for Social Media Processing (SoMeLVLM) SoMeLVLM is a cognitive framework equipped with five key capabilities including knowledge & comprehension, application, analysis, evaluation, and creation. Our experiments demonstrate that SoMeLVLM achieves state-of-the-art performance in multiple social media tasks.
arXiv Detail & Related papers (2024-02-20T14:02:45Z)
MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z)
M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis [28.958168542624062]
We present an interactive visual analytics system, M2Lens, to visualize and explain multimodal models for sentiment analysis. M2Lens provides explanations on intra- and inter-modal interactions at the global, subset, and local levels.
arXiv Detail & Related papers (2021-07-17T15:54:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.