Evaluating LLMs' Inherent Multi-hop Reasoning Ability
- URL: http://arxiv.org/abs/2402.11924v4
- Date: Fri, 5 Jul 2024 13:43:43 GMT
- Title: Evaluating LLMs' Inherent Multi-hop Reasoning Ability
- Authors: Jian Wu, Linyi Yang, Zhen Wang, Manabu Okumura, Yue Zhang,
- Abstract summary: Multi-step reasoning abilities on multiple evidence integration on Multi-hop QA tasks remain underexplored.
Current Multi-hop QA benchmarks are factual and annotated on open-source corpora such as Wikipedia.
We introduce the Inherent Reasoning Evaluation (IRE) method, a novel evaluation way that jointly evaluates the LLMs' chain-of-reasoning performance.
- Score: 39.64489055580211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Large Language Models (LLMs) excel in question-answering (QA) tasks, their multi-step reasoning abilities on multiple evidence integration on Multi-hop QA tasks remain underexplored. LLMs sometimes generate answers that rely on internal memory rather than reasoning given context, which brings concerns about the evaluation quality of real reasoning abilities. The counterfactual QA task can separate internal memory from reasoning abilities, but focusing solely on final-QA performance without evaluating the multi-step reasoning process is insufficient for reporting LLMs' real reasoning abilities. Current Multi-hop QA (MHQA) benchmarks are factual and annotated on open-source corpora such as Wikipedia, although useful for multi-step reasoning evaluation, showing limitations due to potential data contamination in LLMs pre-training stage. To address this issue, we introduce the Inherent Reasoning Evaluation (IRE) method, a novel evaluation way that jointly evaluates the LLMs' chain-of-reasoning performance based on the first knowledge-edited counterfactual multi-hop QA data which involves editing the original Wikipedia passages, reducing data contamination risks. The IRE comprehensively assesses reasoning chains through sub-QA and final-QA evaluations. Our comparisons reveal significant performance gaps for several LLMs between Wikipedia-based benchmarks and IRE, deeming data contamination issues in existing benchmarks. We believe that the IRE benchmark will enhance and facilitate trustworthy LLM evaluations.
Related papers
- CounterBench: A Benchmark for Counterfactuals Reasoning in Large Language Models [5.409370027524351]
We evaluate the performance of large language models (LLMs) in counterfactual reasoning.
We introduce a new benchmark dataset, CounterBench, comprising 1K counterfactual reasoning questions.
arXiv Detail & Related papers (2025-02-16T06:19:37Z) - Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning [34.427730009102966]
We develop an automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs.
Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.
arXiv Detail & Related papers (2025-02-08T19:49:32Z) - CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models [18.975064947089805]
Causal reasoning capabilities are essential for large language models (LLMs) in a wide range of applications, such as education and healthcare.
We provide a benchmark, named by CARL-GT, which evaluates CAusal Reasoning capabilities of large Language models using Graphs and Tabular data.
arXiv Detail & Related papers (2024-12-23T20:34:32Z) - A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs.
It is constructed using a dataset curated from 30 well-known GitHub repositories.
We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers.
We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.