Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner
- URL: http://arxiv.org/abs/2506.01301v1
- Date: Mon, 02 Jun 2025 04:23:45 GMT
- Title: Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner
- Authors: Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Dae Houlihan, Soroush Vosoughi, Shao-Yuan Lo,
- Abstract summary: We propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates.<n>Our framework introduces weak-to-strong control, allowing smaller language models to specialize in ToM-specific likelihood estimation.<n>Our method achieves a 4.6% accuracy improvement over state-of-the-art techniques on multimodal ToM benchmarks.
- Score: 32.33827730707331
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Theory-of-Mind (ToM) enables humans to infer mental states-such as beliefs, desires, and intentions-forming the foundation of social cognition. However, existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning, which struggle with scalability in multimodal environments and fail to generalize as task complexity increases. To address these limitations, we propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates. Our framework introduces weak-to-strong control, allowing smaller language models (LMs) to specialize in ToM-specific likelihood estimation and transfer their reasoning behaviors to larger LMs (7B to 405B) for integration with social and world knowledge. This synergistic approach aligns large-model inference of human mental states with Bayesian principles. Extensive experiments show that our method achieves a 4.6% accuracy improvement over state-of-the-art techniques on multimodal ToM benchmarks, including challenging unseen scenarios, thereby establishing a new standard for modeling human mental states in complex environments.
Related papers
- Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning [0.022940141855172035]
We propose a hybrid approach to machine Theory of Mind (ToM) that uses large language models (LLMs) as a mechanism for generating hypotheses and likelihood functions.<n>We also exhibit the model's potential to predict mental states on open-ended tasks.
arXiv Detail & Related papers (2025-07-04T16:01:27Z) - UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs [1.4304078520604593]
Theory of Mind (ToM) remains a challenging area for large language models (LLMs)<n>In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH.
arXiv Detail & Related papers (2025-06-11T06:55:40Z) - Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models [79.52467430114805]
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains.<n>In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior.<n>Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities.
arXiv Detail & Related papers (2025-05-08T03:35:23Z) - Can Atomic Step Decomposition Enhance the Self-structured Reasoning of Multimodal Large Models? [68.72260770171212]
We propose a paradigm of Self-structured Chain of Thought (SCoT), which is composed of minimal semantic atomic steps.<n>Our method can not only generate cognitive CoT structures for various complex tasks but also mitigates the phenomenon of overthinking.<n>We conduct extensive experiments to show that the proposed AtomThink significantly improves the performance of baseline MLLMs.
arXiv Detail & Related papers (2025-03-08T15:23:47Z) - Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language Models through Simulation and Task Decomposition [2.089191490381739]
Theory of Mind (ToM) is the ability to understand and reflect on the mental states of others.<n>Large Language Models (LLMs) possess only a rudimentary understanding of ToM.<n>We propose Decompose-ToM'': an LLM-based inference algorithm that improves model performance on complex ToM tasks.
arXiv Detail & Related papers (2025-01-15T18:44:01Z) - Towards Modality Generalization: A Benchmark and Prospective Analysis [68.20973671493203]
This paper introduces Modality Generalization (MG), which focuses on enabling models to generalize to unseen modalities.<n>We propose a comprehensive benchmark featuring multi-modal algorithms and adapt existing methods that focus on generalization.<n>Our work provides a foundation for advancing robust and adaptable multi-modal models, enabling them to handle unseen modalities in realistic scenarios.
arXiv Detail & Related papers (2024-12-24T08:38:35Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - A Notion of Complexity for Theory of Mind via Discrete World Models [2.487142846438629]
Theory of Mind (ToM) can be used to assess the capabilities of Large Language Models (LLMs) in complex scenarios where social reasoning is required.
This work proposes a framework inspired by cognitive load theory to measure the complexity of ToM tasks.
arXiv Detail & Related papers (2024-06-16T16:46:55Z) - What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models [50.97705264224828]
We propose Counterfactual Inception, a novel method that implants counterfactual thinking into Large Multi-modal Models.
We aim for the models to engage with and generate responses that span a wider contextual scene understanding.
Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination.
arXiv Detail & Related papers (2024-03-20T11:27:20Z) - MMToM-QA: Multimodal Theory of Mind Question Answering [80.87550820953236]
Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
arXiv Detail & Related papers (2024-01-16T18:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.