Related papers: MMToM-QA: Multimodal Theory of Mind Question Answering

MMToM-QA: Multimodal Theory of Mind Question Answering

URL: http://arxiv.org/abs/2401.08743v2
Date: Sat, 15 Jun 2024 10:13:14 GMT
Title: MMToM-QA: Multimodal Theory of Mind Question Answering
Authors: Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu,
Abstract summary: Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
Score: 80.87550820953236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.

Related papers

MOMENTS: A Comprehensive Multimodal Benchmark for Theory of Mind [28.25540132218273]
Theory of Mind is essential for building socially intelligent multimodal agents.<n>We introduce MOMENTS, a benchmark designed to assess the ToM capabilities of multimodal large language models.
arXiv Detail & Related papers (2025-07-06T15:06:30Z)
From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models [17.235722538085263]
This study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of Theory of Mind (ToM) in large language models (MLLMs)<n>We first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives.<n>Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities.
arXiv Detail & Related papers (2025-06-17T06:27:42Z)
Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning [76.35753243272521]
We introduce VisualPRM, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) Our model achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels.
arXiv Detail & Related papers (2025-03-13T12:03:37Z)
Decompose-ToM: Enhancing Theory of Mind Reasoning in Large Language Models through Simulation and Task Decomposition [2.089191490381739]
Theory of Mind (ToM) is the ability to understand and reflect on the mental states of others. Large Language Models (LLMs) possess only a rudimentary understanding of ToM. We propose Decompose-ToM'': an LLM-based inference algorithm that improves model performance on complex ToM tasks.
arXiv Detail & Related papers (2025-01-15T18:44:01Z)
Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench [17.73279547506514]
We introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning. MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives. Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation and cloze tasks, while multimodal unlearning approaches perform better in classification tasks with multimodal inputs.
arXiv Detail & Related papers (2024-10-29T15:07:23Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to rigorously evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously. Our main findings reveal that most OLMs exhibit critical limitations in instruction-following and reasoning capabilities within tri-modal contexts. To address this gap, we curate an instruction tuning dataset of 84.5K training samples, OmniInstruct, for training OLMs to adapt to multimodal contexts.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond [48.43910061720815]
Multi-modal generative AI has received increasing attention in both academia and industry. One natural question arises: Is it possible to have a unified model for both understanding and generation?
arXiv Detail & Related papers (2024-09-23T13:16:09Z)
Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions [9.318796743761224]
We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. MToMnet encodes contextual cues and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters.
arXiv Detail & Related papers (2024-07-09T11:15:51Z)
Explore the Limits of Omni-modal Pretraining at Scale [21.82148059125346]
We propose a scalable pretraining paradigm, named Multimodal Context (MiCo) MiCo can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. Our models establish 37 new records for state-of-the-art performance.
arXiv Detail & Related papers (2024-06-13T17:59:53Z)
Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks. Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z)
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models. Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)
InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6. The model owns strong capability of modeling interaction between the information flows of different modalities. We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.