Is Multi-Hop Reasoning Really Explainable? Towards Benchmarking
Reasoning Interpretability
- URL: http://arxiv.org/abs/2104.06751v1
- Date: Wed, 14 Apr 2021 10:12:05 GMT
- Title: Is Multi-Hop Reasoning Really Explainable? Towards Benchmarking
Reasoning Interpretability
- Authors: Xin Lv, Yixin Cao, Lei Hou, Juanzi Li, Zhiyuan Liu, Yichi Zhang, Zelin
Dai
- Abstract summary: We propose a unified framework to quantitatively evaluate the interpretability of multi-hop reasoning models.
In specific, we define three metrics including path recall, local interpretability, and global interpretability for evaluation.
Results show that the interpretability of current multi-hop reasoning models is less satisfactory and is still far from the upper bound given by our benchmark.
- Score: 33.220997121043965
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Multi-hop reasoning has been widely studied in recent years to obtain more
interpretable link prediction. However, we find in experiments that many paths
given by these models are actually unreasonable, while little works have been
done on interpretability evaluation for them. In this paper, we propose a
unified framework to quantitatively evaluate the interpretability of multi-hop
reasoning models so as to advance their development. In specific, we define
three metrics including path recall, local interpretability, and global
interpretability for evaluation, and design an approximate strategy to
calculate them using the interpretability scores of rules. Furthermore, we
manually annotate all possible rules and establish a Benchmark to detect the
Interpretability of Multi-hop Reasoning (BIMR). In experiments, we run nine
baselines on our benchmark. The experimental results show that the
interpretability of current multi-hop reasoning models is less satisfactory and
is still far from the upper bound given by our benchmark. Moreover, the
rule-based models outperform the multi-hop reasoning models in terms of
performance and interpretability, which points to a direction for future
research, i.e., we should investigate how to better incorporate rule
information into the multi-hop reasoning model. Our codes and datasets can be
obtained from https://github.com/THU-KEG/BIMR.
Related papers
- On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z) - RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation [5.080252830507515]
Reasoning Process Tree Score (RPTS) is a tree structure-based metric to assess reasoning processes.<n>To validate RPTS in real-world multimodal scenarios, we construct a new benchmark, RPTS-Eval, comprising 374 images and 390 reasoning instances.
arXiv Detail & Related papers (2025-11-10T09:48:07Z) - CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance [10.843417240658992]
Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs)<n>We argue that existing benchmarks for evaluating this ability have critical shortcomings.<n>We introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB)
arXiv Detail & Related papers (2025-08-22T08:17:31Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - Interleaved Reasoning for Large Language Models via Reinforcement Learning [22.403928213802036]
Long chain-of-thought (CoT) enhances large language models' (LLM) reasoning capabilities.<n>We propose a novel training paradigm that uses reinforcement learning (RL) to guide reasoning LLMs to interleave thinking and answering for multi-hop questions.
arXiv Detail & Related papers (2025-05-26T07:58:17Z) - Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering [24.446222685949227]
Large language models (LLMs) face challenges in knowledge-intensive multi-hop reasoning.<n>We propose Self-Critique Guided Iterative Reasoning (SiGIR)<n>SiGIR uses self-critique feedback to guide the iterative reasoning process.
arXiv Detail & Related papers (2025-05-25T12:10:24Z) - Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching [60.04718679054704]
We introduce Sketch-of-Thought (SoT), a novel prompting framework.
It combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize token usage.
SoT achieves token reductions of 76% with negligible accuracy impact.
arXiv Detail & Related papers (2025-03-07T06:57:17Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.
We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains.
We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z) - Seemingly Plausible Distractors in Multi-Hop Reasoning: Are Large Language Models Attentive Readers? [6.525065859315515]
We investigate whether Large Language Models (LLMs) are prone to exploiting simplifying cues in multi-hop reasoning benchmarks.
Motivated by this finding, we propose a challenging multi-hop reasoning benchmark, by generating seemingly plausible multi-hop reasoning chains.
We find that their performance to perform multi-hop reasoning is affected, as indicated by up to 45% relative decrease in F1 score when presented with such seemingly plausible alternatives.
arXiv Detail & Related papers (2024-09-08T19:22:58Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs [87.34281749422756]
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks.
However, their mastery of underlying inferential rules still falls short of human capabilities.
We propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic.
arXiv Detail & Related papers (2024-02-18T03:38:51Z) - Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation [110.71955853831707]
We view LMs as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time.
We formalize the reasoning paths as random walk paths on the knowledge/reasoning graphs.
Experiments and analysis on multiple KG and CoT datasets reveal the effect of training on random walk paths.
arXiv Detail & Related papers (2024-02-05T18:25:51Z) - Did the Models Understand Documents? Benchmarking Models for Language
Understanding in Document-Level Relation Extraction [2.4665182280122577]
Document-level relation extraction (DocRE) attracts more research interest recently.
While models achieve consistent performance gains in DocRE, their underlying decision rules are still understudied.
In this paper, we take the first step toward answering this question and then introduce a new perspective on comprehensively evaluating a model.
arXiv Detail & Related papers (2023-06-20T08:52:05Z) - HOP, UNION, GENERATE: Explainable Multi-hop Reasoning without Rationale
Supervision [118.0818807474809]
This work proposes a principled, probabilistic approach for training explainable multi-hop QA systems without rationale supervision.
Our approach performs multi-hop reasoning by explicitly modeling rationales as sets, enabling the model to capture interactions between documents and sentences within a document.
arXiv Detail & Related papers (2023-05-23T16:53:49Z) - STREET: A Multi-Task Structured Reasoning and Explanation Benchmark [56.555662318619135]
We introduce a unified multi-task and multi-domain natural language reasoning and explanation benchmark.
We expect models to not only answer questions, but also produce step-by-step structured explanations describing how premises in the question are used to produce intermediate conclusions that can prove the correctness of a certain answer.
arXiv Detail & Related papers (2023-02-13T22:34:02Z) - Reasoning Circuits: Few-shot Multihop Question Generation with
Structured Rationales [11.068901022944015]
Chain-of-thought rationale generation has been shown to improve performance on multi-step reasoning tasks.
We introduce a new framework for applying chain-of-thought inspired structured rationale generation to multi-hop question generation under a very low supervision regime.
arXiv Detail & Related papers (2022-11-15T19:36:06Z) - MPLR: a novel model for multi-target learning of logical rules for
knowledge graph reasoning [5.499688003232003]
We study the problem of learning logic rules for reasoning on knowledge graphs for completing missing factual triplets.
We propose a model called MPLR that improves the existing models to fully use training data and multi-target scenarios are considered.
Experimental results empirically demonstrate that our MPLR model outperforms state-of-the-art methods on five benchmark datasets.
arXiv Detail & Related papers (2021-12-12T09:16:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.