Benchmarking Sequential Visual Input Reasoning and Prediction in
Multimodal Large Language Models
- URL: http://arxiv.org/abs/2310.13473v1
- Date: Fri, 20 Oct 2023 13:14:38 GMT
- Title: Benchmarking Sequential Visual Input Reasoning and Prediction in
Multimodal Large Language Models
- Authors: Mingwei Zhu, Leigang Sha, Yu Shu, Kangjia Zhao, Tiancheng Zhao,
Jianwei Yin
- Abstract summary: We introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios.
Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction.
Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods.
- Score: 21.438427686724932
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal large language models (MLLMs) have shown great potential in
perception and interpretation tasks, but their capabilities in predictive
reasoning remain under-explored. To address this gap, we introduce a novel
benchmark that assesses the predictive reasoning capabilities of MLLMs across
diverse scenarios. Our benchmark targets three important domains: abstract
pattern reasoning, human activity prediction, and physical interaction
prediction. We further develop three evaluation methods powered by large
language model to robustly quantify a model's performance in predicting and
reasoning the future based on multi-visual context. Empirical experiments
confirm the soundness of the proposed benchmark and evaluation methods via
rigorous testing and reveal pros and cons of current popular MLLMs in the task
of predictive reasoning. Lastly, our proposed benchmark provides a standardized
evaluation framework for MLLMs and can facilitate the development of more
advanced models that can reason and predict over complex long sequence of
multimodal input.
Related papers
- Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates [0.0]
We propose a framework that interprets large language models (LLMs) as advocates within an ensemble of interacting agents.
This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics.
arXiv Detail & Related papers (2024-10-07T00:22:07Z) - A Survey on Multimodal Benchmarks: In the Era of Large AI Models [13.299775710527962]
Multimodal Large Language Models (MLLMs) have brought substantial advancements in artificial intelligence.
This survey systematically reviews 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application.
arXiv Detail & Related papers (2024-09-21T15:22:26Z) - N-gram Prediction and Word Difference Representations for Language Modeling [0.0]
We introduce a simple N-gram prediction framework for the Causal Language Model (CLM) task.
We also introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training.
To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results.
arXiv Detail & Related papers (2024-09-05T07:03:23Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.
We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.
Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models [24.445829787297658]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications.
This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs)
Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction.
arXiv Detail & Related papers (2024-02-21T15:58:37Z) - Evaluating and Explaining Large Language Models for Code Using Syntactic
Structures [74.93762031957883]
This paper introduces ASTxplainer, an explainability method specific to Large Language Models for code.
At its core, ASTxplainer provides an automated method for aligning token predictions with AST nodes.
We perform an empirical evaluation on 12 popular LLMs for code using a curated dataset of the most popular GitHub projects.
arXiv Detail & Related papers (2023-08-07T18:50:57Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - ThinkSum: Probabilistic reasoning over sets using large language models [18.123895485602244]
We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner.
We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks.
arXiv Detail & Related papers (2022-10-04T00:34:01Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.