RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
- URL: http://arxiv.org/abs/2511.06899v1
- Date: Mon, 10 Nov 2025 09:48:07 GMT
- Title: RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
- Authors: Haofeng Wang, Yu Zhang,
- Abstract summary: Reasoning Process Tree Score (RPTS) is a tree structure-based metric to assess reasoning processes.<n>To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances.
- Score: 5.080252830507515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks assess the reasoning process, their methods are often overly simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of intermodal relationships on reasoning. To address this issue, we propose the Reasoning Process Tree Score (RPTS), a tree structure-based metric to assess reasoning processes. Specifically, we organize the reasoning steps into a reasoning tree and leverage its hierarchical information to assign weighted faithfulness scores to each reasoning step. By dynamically adjusting these weights, RPTS not only evaluates the overall correctness of the reasoning, but also pinpoints where the model fails in the reasoning. To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances. Each instance includes reliable visual-textual clues that serve as leaf nodes of the reasoning tree. Furthermore, we define three types of intermodal relationships to investigate how intermodal interactions influence the reasoning process. We evaluated representative LVLMs (e.g., GPT4o, Llava-Next), uncovering their limitations in multimodal reasoning and highlighting the differences between open-source and closed-source commercial LVLMs. We believe that this benchmark will contribute to the advancement of research in the field of multimodal reasoning.
Related papers
- Multimodal Fact-Level Attribution for Verifiable Reasoning [80.60864342985748]
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation.<n>Existing multimodal grounding benchmarks and evaluation methods fail to assess attribution in complex multimodal reasoning.<n>We introduce MuRGAt, a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation.
arXiv Detail & Related papers (2026-02-12T03:10:02Z) - Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z) - CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance [10.843417240658992]
Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs)<n>We argue that existing benchmarks for evaluating this ability have critical shortcomings.<n>We introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB)
arXiv Detail & Related papers (2025-08-22T08:17:31Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMs [54.4857963044859]
We propose DialogueReason, a reasoning paradigm that uncovers the lost roles in monologue-style reasoning models.<n>Our work consists of an analysis of monologue reasoning patterns and the development of a dialogue-based reasoning approach.
arXiv Detail & Related papers (2025-05-11T16:39:58Z) - Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models [46.26140720993383]
Multi-LogiEval is a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths.
We conduct evaluations on a range of Large Language Models including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral.
arXiv Detail & Related papers (2024-06-24T23:02:56Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - STREET: A Multi-Task Structured Reasoning and Explanation Benchmark [56.555662318619135]
We introduce a unified multi-task and multi-domain natural language reasoning and explanation benchmark.
We expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer.
arXiv Detail & Related papers (2023-02-13T22:34:02Z) - Is Multi-Hop Reasoning Really Explainable? Towards Benchmarking
Reasoning Interpretability [33.220997121043965]
We propose a unified framework to quantitatively evaluate the interpretability of multi-hop reasoning models.
In specific, we define three metrics including path recall, local interpretability, and global interpretability for evaluation.
Results show that the interpretability of current multi-hop reasoning models is less satisfactory and is still far from the upper bound given by our benchmark.
arXiv Detail & Related papers (2021-04-14T10:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.