MMToM-QA: Multimodal Theory of Mind Question Answering
- URL: http://arxiv.org/abs/2401.08743v2
- Date: Sat, 15 Jun 2024 10:13:14 GMT
- Title: MMToM-QA: Multimodal Theory of Mind Question Answering
- Authors: Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu,
- Abstract summary: Theory of Mind (ToM) is an essential ingredient for developing machines with human-level social intelligence.
Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding.
Human ToM, on the other hand, is more than video or text understanding.
People can flexibly reason about another person's mind based on conceptual representations extracted from any available data.
- Score: 80.87550820953236
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Theory of Mind (ToM), the ability to understand people's mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets - either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person's mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person's activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
Related papers
- Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions [9.318796743761224]
We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input.
MToMnet encodes contextual cues and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person.
Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters.
arXiv Detail & Related papers (2024-07-09T11:15:51Z) - Explore the Limits of Omni-modal Pretraining at Scale [21.82148059125346]
We propose a scalable pretraining paradigm, named Multimodal Context (MiCo)
MiCo can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process.
Our models establish 37 new records for state-of-the-art performance.
arXiv Detail & Related papers (2024-06-13T17:59:53Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual words, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Multimodality and Attention Increase Alignment in Natural Language
Prediction Between Humans and Computational Models [0.8139163264824348]
Humans are known to use salient multimodal features, such as visual cues, to facilitate the processing of upcoming words.
multimodal computational models can integrate visual and linguistic data using a visual attention mechanism to assign next-word probabilities.
We show that predictability estimates from humans aligned more closely with scores generated from multimodal models vs. their unimodal counterparts.
arXiv Detail & Related papers (2023-08-11T09:30:07Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.