Related papers: Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework

Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework

URL: http://arxiv.org/abs/2502.14864v1
Date: Thu, 20 Feb 2025 18:59:42 GMT
Title: Benchmarking Multimodal RAG through a Chart-based Document Question-Answering Generation Framework
Authors: Yuming Yang, Jiang Zhong, Li Jin, Jingwang Huang, Jingpeng Gao, Qing Liu, Yang Bai, Jingyuan Zhang, Rui Jiang, Kaiwen Wei,
Abstract summary: Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge.<n>Existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications.<n>We propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation.
Score: 17.838177710655287
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Retrieval-Augmented Generation (MRAG) enhances reasoning capabilities by integrating external knowledge. However, existing benchmarks primarily focus on simple image-text interactions, overlooking complex visual formats like charts that are prevalent in real-world applications. In this work, we introduce a novel task, Chart-based MRAG, to address this limitation. To semi-automatically generate high-quality evaluation samples, we propose CHARt-based document question-answering GEneration (CHARGE), a framework that produces evaluation data through structured keypoint extraction, crossmodal verification, and keypoint-based generation. By combining CHARGE with expert validation, we construct Chart-MRAG Bench, a comprehensive benchmark for chart-based MRAG evaluation, featuring 4,738 question-answering pairs across 8 domains from real-world documents. Our evaluation reveals three critical limitations in current approaches: (1) unified multimodal embedding retrieval methods struggles in chart-based scenarios, (2) even with ground-truth retrieval, state-of-the-art MLLMs achieve only 58.19% Correctness and 73.87% Coverage scores, and (3) MLLMs demonstrate consistent text-over-visual modality bias during Chart-based MRAG reasoning. The CHARGE and Chart-MRAG Bench are released at https://github.com/Nomothings/CHARGE.git.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering [14.468507852394923]
Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models.<n>We introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings.<n>We propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements.
arXiv Detail & Related papers (2025-05-29T08:46:03Z)
mmRAG: A Modular Benchmark for Retrieval-Augmented Generation over Text, Tables, and Knowledge Graphs [11.861763118322136]
We introduce mmRAG, a modular benchmark for evaluating multi-modal RAG systems.<n>Our benchmark integrates queries from six diverse question-answering datasets spanning text, tables, and knowledge graphs.<n>We follow standard information retrieval procedures to annotate document relevance and derive dataset relevance.
arXiv Detail & Related papers (2025-05-16T12:31:29Z)
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z)
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding. Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z)
QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance [1.433758865948252]
This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems.<n>RAG architecture is constructed to generate responses from the target document.<n>We introduce QuIM-RAG, a novel approach for the retrieval mechanism in our system.
arXiv Detail & Related papers (2025-01-06T01:07:59Z)
VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries. We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [69.4501863547618]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Multi-Head RAG: Solving Multi-Aspect Problems with LLMs [18.581518952488093]
Multi-Head RAG (MRAG) is a novel scheme for fetching multi-aspect documents.<n>We show MRAG's design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios.
arXiv Detail & Related papers (2024-06-07T16:59:38Z)
Evaluating Retrieval Quality in Retrieval-Augmented Generation [21.115495457454365]
Traditional end-to-end evaluation methods are computationally expensive. We propose eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system. eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
arXiv Detail & Related papers (2024-04-21T21:22:28Z)
ChartBench: A Benchmark for Complex Visual Reasoning in Charts [36.492851648081405]
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation. Current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics. We propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
arXiv Detail & Related papers (2023-12-26T07:20:55Z)
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [48.63002688222462]
A gap remains in the domain of chart image understanding due to the distinct abstract components in charts. We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types. We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
arXiv Detail & Related papers (2023-11-15T23:36:42Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.