MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
- URL: http://arxiv.org/abs/2408.12574v4
- Date: Thu, 23 Jan 2025 16:31:49 GMT
- Title: MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
- Authors: Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu,
- Abstract summary: We introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark.
We provide video and text descriptions of people's multi-modal behavior in realistic household environments.
We then ask questions about people's goals, beliefs, and beliefs about others' goals.
- Score: 10.079620078670589
- License:
- Abstract: Understanding people's social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal -- we can watch people's actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people's mental states as well as their inferences about each other's mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people's multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people's goals, beliefs, and beliefs about others' goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.
Related papers
- ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind [25.524355451378593]
ToMATO is a new ToM benchmark formulated as multiple-choice QA over conversations.
We capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge.
ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns.
arXiv Detail & Related papers (2025-01-15T14:47:02Z) - LMAgent: A Large-scale Multimodal Agents Society for Multi-user Simulation [66.52371505566815]
Large language models (LLMs)-based AI agents have made significant progress, enabling them to achieve human-like intelligence.
We present LMAgent, a very large-scale and multimodal agents society based on multimodal LLMs.
In LMAgent, besides chatting with friends, the agents can autonomously browse, purchase, and review products, even perform live streaming e-commerce.
arXiv Detail & Related papers (2024-12-12T12:47:09Z) - Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions [9.318796743761224]
We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input.
MToMnet encodes contextual cues and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person.
Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters.
arXiv Detail & Related papers (2024-07-09T11:15:51Z) - Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models [52.894048516550065]
We develop a pipeline for multimodal ToM reasoning using video and text.
We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question.
arXiv Detail & Related papers (2024-06-19T18:24:31Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z) - SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems [53.94772445896213]
Large Language Model (LLM)-based multi-agent systems have demonstrated promising performance in simulating human society.
We propose SpeechAgents, a multi-modal LLM based multi-agent system designed for simulating human communication.
arXiv Detail & Related papers (2024-01-08T15:01:08Z) - On the Linguistic and Computational Requirements for Creating
Face-to-Face Multimodal Human-Machine Interaction [0.0]
We videorecorded thirty-four human-avatar interactions, performed complete linguistic microanalysis on video excerpts, and marked all the occurrences of multimodal actions and events.
The data show evidence that double-loop feedback is established during a face-to-face conversation.
We propose that knowledge from Conversation Analysis (CA), cognitive science, and Theory of Mind (ToM), among others, should be incorporated into the ones used for describing human-machine multimodal interactions.
arXiv Detail & Related papers (2022-11-24T21:17:36Z) - Learning Triadic Belief Dynamics in Nonverbal Communication from Videos [81.42305032083716]
Nonverbal communication can convey rich social information among agents.
In this paper, we incorporate different nonverbal communication cues to represent, model, learn, and infer agents' mental states.
arXiv Detail & Related papers (2021-04-07T00:52:04Z) - SMARTS: Scalable Multi-Agent Reinforcement Learning Training School for
Autonomous Driving [96.50297622371457]
Multi-agent interaction is a fundamental aspect of autonomous driving in the real world.
Despite more than a decade of research and development, the problem of how to interact with diverse road users in diverse scenarios remains largely unsolved.
We develop a dedicated simulation platform called SMARTS that generates diverse and competent driving interactions.
arXiv Detail & Related papers (2020-10-19T18:26:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.