MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning
- URL: http://arxiv.org/abs/2404.13591v2
- Date: Wed, 24 Apr 2024 22:32:10 GMT
- Title: MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning
- Authors: Yifan Jiang, Jiarui Zhang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara,
- Abstract summary: We evaluate whether multi-modal large language models (MLLMs) possess abstract visual reasoning abilities.
Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns.
We introduce MARVEL, a benchmark with 770 MLLMs composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations.
- Score: 22.440669015518015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While multi-modal large language models (MLLMs) have shown significant progress on many popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only considered a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 by 3 matrices). To evaluate MLLMs' reasoning abilities comprehensively, we introduce MARVEL, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model accuracy is grounded in perception and reasoning, MARVEL complements the general AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with nine representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all models show near-random performance on the AVR question, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance) and even count the panels in the puzzle ( <45%), hindering their ability for abstract reasoning. We release our entire code and dataset.
Related papers
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - What is the Visual Cognition Gap between Humans and Multimodal LLMs? [22.99627171182423]
Multimodal Large Language Models (MLLMs) have shown great promise in language-guided tasks such as recognition, segmentation, and object detection.
One such challenge is abstract visual reasoning (AVR) -- the cognitive ability to discern relationships among patterns in a set of images and extrapolate to predict subsequent patterns.
We propose new dataset MaRs-VQA and a new benchmark VCog-Bench to evaluate the zero-shot capability of MLLMs.
arXiv Detail & Related papers (2024-06-14T22:02:21Z) - Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning [40.972648044298374]
Multi-Modal Large Language Models (MLLMs) have demonstrated impressive performance in various VQA tasks.
They often lack interpretability and struggle with complex visual inputs.
We propose a multi-turn processing pipeline that dynamically focuses on visual inputs and provides interpretable thoughts.
arXiv Detail & Related papers (2024-03-25T17:59:23Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection
with Multimodal Large Language Models [63.946809247201905]
We introduce a new benchmark, namely SHIELD, to evaluate the ability of MLLMs on face spoofing and forgery detection.
We design true/false and multiple-choice questions to evaluate multimodal face data in these two face security tasks.
The results indicate that MLLMs hold substantial potential in the face security domain.
arXiv Detail & Related papers (2024-02-06T17:31:36Z) - REBUS: A Robust Evaluation Benchmark of Understanding Symbols [1.90463290938268]
GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models.
Even the best model has a final accuracy of only 42%, which goes down to just 7% on hard puzzles.
Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.
arXiv Detail & Related papers (2024-01-11T00:30:28Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Learning Abstract Visual Reasoning via Task Decomposition: A Case Study
in Raven Progressive Matrices [0.24475591916185496]
In Raven Progressive Matrices, the task is to choose one of the available answers given a context.
In this study, we propose a deep learning architecture based on the transformer blueprint.
The multidimensional predictions obtained in this way are then directly juxtaposed to choose the answer.
arXiv Detail & Related papers (2023-08-12T11:02:21Z) - See, Think, Confirm: Interactive Prompting Between Vision and Language
Models for Knowledge-based Visual Reasoning [60.43585179885355]
We propose a novel framework named Interactive Prompting Visual Reasoner (IPVR) for few-shot knowledge-based visual reasoning.
IPVR contains three stages, see, think and confirm.
We conduct experiments on a range of knowledge-based visual reasoning datasets.
arXiv Detail & Related papers (2023-01-12T18:59:50Z) - ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos [79.05486554647918]
We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos.
In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD)
We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
arXiv Detail & Related papers (2021-07-24T15:14:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.