Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences
- URL: http://arxiv.org/abs/2401.10529v2
- Date: Thu, 25 Jan 2024 04:11:57 GMT
- Title: Mementos: A Comprehensive Benchmark for Multimodal Large Language Model
Reasoning over Image Sequences
- Authors: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong
He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong
Huang
- Abstract summary: This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities.
We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
- Score: 80.54979242912944
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated proficiency in
handling a variety of visual-language tasks. However, current MLLM benchmarks
are predominantly designed to evaluate reasoning based on static information
about a single image, and the ability of modern MLLMs to extrapolate from image
sequences, which is essential for understanding our ever-changing world, has
been less investigated. To address this challenge, this paper introduces
Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning
abilities. Mementos features 4,761 diverse image sequences with varying
lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning
performance. Through a careful evaluation of nine recent MLLMs on Mementos,
including GPT-4V and Gemini, we find that they struggle to accurately describe
dynamic information about given image sequences, often leading to
hallucinations/misrepresentations of objects and their corresponding behaviors.
Our quantitative analysis and case studies identify three key factors impacting
MLLMs' sequential image reasoning: the correlation between object and
behavioral hallucinations, the influence of cooccurring behaviors, and the
compounding impact of behavioral hallucinations. Our dataset is available at
https://github.com/umd-huang-lab/Mementos.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena.
We propose a novel dynamic correction decoding method for MLLMs (DeCo)
We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z) - MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a comprehensive dataset for studying inter-object relations with Multi-modal Large Language Models (MLLMs)
MMRel features three distinctive attributes: (i) It includes over 15K question-answer pairs, which are sourced from three distinct domains, ensuring large scale and high diversity; (ii) It contains a subset featuring highly unusual relations, on which MLLMs often fail due to hallucinations, thus are very challenging; (iii) It provides manually verified high-quality labels for inter-object relations.
arXiv Detail & Related papers (2024-06-13T13:51:59Z) - NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language
Models [34.91372939329467]
We introduce a benchmark, NPHardEval4V, to evaluate the pure reasoning abilities of MLLMs.
Our findings reveal significant discrepancies in reasoning abilities across different models.
We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-03-04T07:10:31Z) - Exploring Perceptual Limitation of Multimodal Large Language Models [57.567868157293994]
We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs.
We identify four independent factors that can contribute to this limitation.
Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
arXiv Detail & Related papers (2024-02-12T03:04:42Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers.
We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images.
We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z) - Temporal Insight Enhancement: Mitigating Temporal Hallucination in
Multimodal Large Language Models [20.33971942003996]
This study introduces an innovative method to address event-level hallucinations in MLLMs.
We propose a unique mechanism that decomposes on-demand event queries into iconic actions.
We employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences.
arXiv Detail & Related papers (2024-01-18T10:18:48Z) - Towards Perceiving Small Visual Details in Zero-shot Visual Question
Answering with Multimodal LLMs [12.598351373932234]
We investigate whether MLLMs can perceive small details as well as large details in images.
We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question.
We propose five automatic visual cropping methods to improve the zero-shot performance of MLLMs.
arXiv Detail & Related papers (2023-10-24T17:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.