Related papers: Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

URL: http://arxiv.org/abs/2401.10529v2
Date: Thu, 25 Jan 2024 04:11:57 GMT
Title: Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
Authors: Xiyao Wang, Yuhang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang
Abstract summary: This paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. We find that MLLMs struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects.
Score: 80.54979242912944
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated proficiency in handling a variety of visual-language tasks. However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated. To address this challenge, this paper introduces Mementos, a new benchmark designed to assess MLLMs' sequential image reasoning abilities. Mementos features 4,761 diverse image sequences with varying lengths. We also employ a GPT-4 assisted method to evaluate MLLM reasoning performance. Through a careful evaluation of nine recent MLLMs on Mementos, including GPT-4V and Gemini, we find that they struggle to accurately describe dynamic information about given image sequences, often leading to hallucinations/misrepresentations of objects and their corresponding behaviors. Our quantitative analysis and case studies identify three key factors impacting MLLMs' sequential image reasoning: the correlation between object and behavioral hallucinations, the influence of cooccurring behaviors, and the compounding impact of behavioral hallucinations. Our dataset is available at https://github.com/umd-huang-lab/Mementos.

Related papers

Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT) GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding. It aims to localize instances of interest across multiple images based on open-ended text prompts. We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation [50.73561815838431]
Multimodal Large Language Models (MLLMs) frequently exhibit hallucination phenomena. We propose a novel dynamic correction decoding method for MLLMs (DeCo) We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines.
arXiv Detail & Related papers (2024-10-15T16:57:44Z)
NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models [34.91372939329467]
We introduce a benchmark, NPHardEval4V, to evaluate the pure reasoning abilities of MLLMs. Our findings reveal significant discrepancies in reasoning abilities across different models. We also investigate the impact of different prompting styles, including visual, text, and combined visual and text prompts, on the reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-03-04T07:10:31Z)
Exploring Perceptual Limitation of Multimodal Large Language Models [57.567868157293994]
We quantitatively study the perception of small visual objects in several state-of-the-art MLLMs. We identify four independent factors that can contribute to this limitation. Lower object quality and smaller object size can both independently reduce MLLMs' ability to answer visual questions.
arXiv Detail & Related papers (2024-02-12T03:04:42Z)
Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision. We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z)
The Instinctive Bias: Spurious Images lead to Illusion in MLLMs [34.91795817316696]
We identify a typical class of inputs that baffles MLLMs, which consist of images that are highly relevant but inconsistent with answers. We propose CorrelationQA, the first benchmark that assesses the visual illusion level given spurious images. We conduct a thorough analysis on 9 mainstream MLLMs, illustrating that they universally suffer from this instinctive bias to varying degrees.
arXiv Detail & Related papers (2024-02-06T06:48:46Z)
Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models [20.33971942003996]
This study introduces an innovative method to address event-level hallucinations in MLLMs. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. We employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences.
arXiv Detail & Related papers (2024-01-18T10:18:48Z)
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks. MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z)
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs [12.598351373932234]
We investigate whether MLLMs can perceive small details as well as large details in images. We show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question. We propose five automatic visual cropping methods to improve the zero-shot performance of MLLMs.
arXiv Detail & Related papers (2023-10-24T17:48:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.