Benchmarking Sequential Visual Input Reasoning and Prediction in
Multimodal Large Language Models
- URL: http://arxiv.org/abs/2310.13473v1
- Date: Fri, 20 Oct 2023 13:14:38 GMT
- Title: Benchmarking Sequential Visual Input Reasoning and Prediction in
Multimodal Large Language Models
- Authors: Mingwei Zhu, Leigang Sha, Yu Shu, Kangjia Zhao, Tiancheng Zhao,
Jianwei Yin
- Abstract summary: We introduce a novel benchmark that assesses the predictive reasoning capabilities of MLLMs across diverse scenarios.
Our benchmark targets three important domains: abstract pattern reasoning, human activity prediction, and physical interaction prediction.
Empirical experiments confirm the soundness of the proposed benchmark and evaluation methods.
- Score: 21.438427686724932
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal large language models (MLLMs) have shown great potential in
perception and interpretation tasks, but their capabilities in predictive
reasoning remain under-explored. To address this gap, we introduce a novel
benchmark that assesses the predictive reasoning capabilities of MLLMs across
diverse scenarios. Our benchmark targets three important domains: abstract
pattern reasoning, human activity prediction, and physical interaction
prediction. We further develop three evaluation methods powered by large
language model to robustly quantify a model's performance in predicting and
reasoning the future based on multi-visual context. Empirical experiments
confirm the soundness of the proposed benchmark and evaluation methods via
rigorous testing and reveal pros and cons of current popular MLLMs in the task
of predictive reasoning. Lastly, our proposed benchmark provides a standardized
evaluation framework for MLLMs and can facilitate the development of more
advanced models that can reason and predict over complex long sequence of
multimodal input.
Related papers
- MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debates [0.0]
We propose a framework that interprets large language models (LLMs) as advocates within an ensemble of interacting agents.
This approach offers a more dynamic and comprehensive evaluation process compared to traditional human-based assessments or automated metrics.
arXiv Detail & Related papers (2024-10-07T00:22:07Z) - N-gram Prediction and Word Difference Representations for Language Modeling [0.0]
We introduce a simple N-gram prediction framework for the Causal Language Model (CLM) task.
We also introduce word difference representation (WDR) as a surrogate and contextualized target representation during model training.
To further enhance the quality of next word prediction, we propose an ensemble method that incorporates the future N words' prediction results.
arXiv Detail & Related papers (2024-09-05T07:03:23Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models [24.445829787297658]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications.
This study aims to scrutinize the validity of such probability-based evaluation methods within the context of using LLMs for Multiple Choice Questions (MCQs)
Our empirical investigation reveals that the prevalent probability-based evaluation method inadequately aligns with generation-based prediction.
arXiv Detail & Related papers (2024-02-21T15:58:37Z) - MMBench: Is Your Multi-modal Model an All-around Player? [114.45702807380415]
We propose MMBench, a benchmark for assessing the multi-modal capabilities of vision-language models.
MMBench is meticulously curated with well-designed quality control schemes.
MMBench incorporates multiple-choice questions in both English and Chinese versions.
arXiv Detail & Related papers (2023-07-12T16:23:09Z) - ThinkSum: Probabilistic reasoning over sets using large language models [18.123895485602244]
We propose a two-stage probabilistic inference paradigm, ThinkSum, which reasons over sets of objects or facts in a structured manner.
We demonstrate the possibilities and advantages of ThinkSum on the BIG-bench suite of LLM evaluation tasks.
arXiv Detail & Related papers (2022-10-04T00:34:01Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.