Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models
- URL: http://arxiv.org/abs/2406.13763v1
- Date: Wed, 19 Jun 2024 18:24:31 GMT
- Title: Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models
- Authors: Zhawnen Chen, Tianchun Wang, Yizhou Wang, Michal Kosinski, Xiang Zhang, Yun Fu, Sheng Li,
- Abstract summary: We develop a pipeline for multimodal ToM reasoning using video and text.
We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question.
- Score: 52.894048516550065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.
Related papers
- MuMA-ToM: Multi-modal Multi-Agent Theory of Mind [10.079620078670589]
We introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark.
We provide video and text descriptions of people's multi-modal behavior in realistic household environments.
We then ask questions about people's goals, beliefs, and beliefs about others' goals.
arXiv Detail & Related papers (2024-08-22T17:41:45Z) - TimeToM: Temporal Space is the Key to Unlocking the Door of Large Language Models' Theory-of-Mind [18.258876784733733]
We present TimeToM, which constructs a temporal space and uses it as the foundation to improve the ToM capabilities of Large Language Models.
We design a novel tool-belief solver that, by considering belief communication between characters in temporal space, can transform a character's higher-order beliefs into another character's first-order beliefs.
arXiv Detail & Related papers (2024-07-01T16:50:49Z) - Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses [11.121931601655174]
Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts.
Large language models (LLMs) excel in tasks such as summarization, question answering, and translation.
Despite advancements, the extent to which LLMs truly understand ToM reasoning remains inadequately explored in open-ended scenarios.
arXiv Detail & Related papers (2024-06-09T05:57:59Z) - Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models [71.93366651585275]
Large language models (LLMs) have exhibited impressive performance in language comprehension and various reasoning tasks.
We propose Visualization-of-Thought (VoT) to elicit spatial reasoning of LLMs by visualizing their reasoning traces.
VoT significantly enhances the spatial reasoning abilities of LLMs.
arXiv Detail & Related papers (2024-04-04T17:45:08Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z) - Think Twice: Perspective-Taking Improves Large Language Models'
Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking.
Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z) - HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning
in Large Language Models [31.831042765744204]
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states.
We introduce HI-TOM, a Higher Order Theory of Mind benchmark.
Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks.
arXiv Detail & Related papers (2023-10-25T16:41:15Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z) - Towards a Mechanistic Interpretation of Multi-Step Reasoning
Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities.
It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism.
We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z) - Are LLMs the Master of All Trades? : Exploring Domain-Agnostic Reasoning
Skills of LLMs [0.0]
This study aims to investigate the performance of large language models (LLMs) on different reasoning tasks.
My findings indicate that LLMs excel at analogical and moral reasoning, yet struggle to perform as proficiently on spatial reasoning tasks.
arXiv Detail & Related papers (2023-03-22T22:53:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.