FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
- URL: http://arxiv.org/abs/2406.19237v2
- Date: Fri, 28 Jun 2024 05:43:46 GMT
- Title: FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
- Authors: Shubhankar Singh, Purvi Chaurasia, Yerram Varun, Pranshu Pandya, Vatsal Gupta, Vivek Gupta, Dan Roth,
- Abstract summary: FlowVQA is a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts.
We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies.
The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling.
- Score: 41.84175991112392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
Related papers
- VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.
These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.
Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering [8.21219588747224]
This paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a vision encoder with a sequence-to-sequence language model.
VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space.
Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-04-11T05:51:44Z) - Can Large Language Models Unveil the Mysteries? An Exploration of Their Ability to Unlock Information in Complex Scenarios [9.761316172913016]
We explore the ability of advanced models to integrate multiple inputs for reasoning in complex scenarios.
We present three plug-and-play approaches: utilizing model input for reasoning, enhancing reasoning through minimum margin decoding, and retrieving semantically relevant data.
Our approach improves the performance of models on reasoning, with a 22.17% boost on CVQA over the SOTA closed-source model.
arXiv Detail & Related papers (2025-02-27T10:58:27Z) - VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.
We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.
We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z) - How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model? [2.3993515715868714]
We propose a novel, generalizable methodology to identify preferred image distributions for Vision-Language Models (VLMs)
Applying this to different rendering types of 3D objects, we demonstrate its efficacy across various domains requiring precise interpretation of complex structures.
To address the lack of benchmarks in specialized domains, we introduce CAD-VQA, a new dataset for evaluatingVLMs on CAD-related visual question answering tasks.
arXiv Detail & Related papers (2024-09-03T19:26:13Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning [8.1113308714581]
This paper introduces a novel multimodal chart question-answering model.
Our model integrates visual and linguistic processing, overcoming the constraints of existing methods.
This approach has demonstrated superior performance on multiple public datasets.
arXiv Detail & Related papers (2024-04-02T01:28:44Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - A Novel Energy based Model Mechanism for Multi-modal Aspect-Based
Sentiment Analysis [85.77557381023617]
We propose a novel framework called DQPSA for multi-modal sentiment analysis.
PDQ module uses the prompt as both a visual query and a language query to extract prompt-aware visual information.
EPE module models the boundaries pairing of the analysis target from the perspective of an Energy-based Model.
arXiv Detail & Related papers (2023-12-13T12:00:46Z) - X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs)
Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z) - Accuracy vs. Complexity: A Trade-off in Visual Question Answering Models [39.338304913058685]
We study the trade-off between the model complexity and the performance on the Visual Question Answering task.
We focus on the effect of "multi-modal fusion" in VQA models that is typically the most expensive step in a VQA pipeline.
arXiv Detail & Related papers (2020-01-20T11:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.