MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
- URL: http://arxiv.org/abs/2503.16856v1
- Date: Fri, 21 Mar 2025 05:02:20 GMT
- Title: MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers
- Authors: Yang Tian, Zheng Lu, Mingqi Gao, Zheng Liu, Bo Zhao,
- Abstract summary: This work presents MMCR, a benchmark designed to evaluate Vision-Language Models' capacity for reasoning with cross-source information from scientific papers.<n>Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models.
- Score: 10.311462547308823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs' capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with only 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. Furthermore, we investigated the impact of the Chain-of-Thought (CoT) technique on cross-source reasoning and observed a detrimental effect on small models, whereas larger models demonstrated substantially enhanced performance. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.
Related papers
- FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction [25.00896070082754]
Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text.
A persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries.
We propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework.
arXiv Detail & Related papers (2025-04-08T01:45:16Z) - Zero-Shot Commonsense Validation and Reasoning with Large Language Models: An Evaluation on SemEval-2020 Task 4 Dataset [0.16385815610837165]
This study evaluates the performance of Large Language Models (LLMs) on SemEval-2020 Task 4 dataset.<n>The models are tested on two tasks: Task A (Commonsense Validation), where models determine whether a statement aligns with commonsense knowledge, and Task B (Commonsense Explanation)<n>Results indicate that larger models outperform previous models, with LLaMA3-70B achieving the highest accuracy of 98.40% in Task A whereas, lagging behind previous models with 93.40% in Task B.
arXiv Detail & Related papers (2025-02-19T12:40:49Z) - Breaking Focus: Contextual Distraction Curse in Large Language Models [68.4534308805202]
We investigate a critical vulnerability in Large Language Models (LLMs)
This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context.
We propose an efficient tree-based search methodology to automatically generate CDV examples.
arXiv Detail & Related papers (2025-02-03T18:43:36Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - Law of the Weakest Link: Cross Capabilities of Large Language Models [102.91861246827797]
We show that Large Language Models (LLMs) exhibit the "Law of the Weakest Link," where cross-capability performance is significantly constrained by the weakest component.
These results highlight the under-performance of LLMs in cross-capability tasks.
arXiv Detail & Related papers (2024-09-30T05:12:01Z) - Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models [61.899791071654654]
We introduce a benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning.
We investigate the performance of state-of-the-art vision-language models (VLMs) on this task.
We develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues.
arXiv Detail & Related papers (2024-09-15T16:45:42Z) - CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers.
To ensure quality, all charts and questions are handpicked, curated, and verified by human experts.
Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z) - Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information.
While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied.
We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z) - Never Lost in the Middle: Mastering Long-Context Question Answering with Position-Agnostic Decompositional Training [9.128501882000315]
Large language models (LLMs) are struggling to seek correct information in long contexts.
This paper proposes to enhance the information searching and reflection ability of LLMs in long contexts via specially designed tasks.
Experimental results show substantial improvement in Multi-doc QA and other benchmarks, superior to state-of-the-art models by 13.7% absolute gain in shuffled settings.
arXiv Detail & Related papers (2023-11-15T18:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.