Related papers: Premise Order Matters in Reasoning with Large Language Models

Premise Order Matters in Reasoning with Large Language Models

URL: http://arxiv.org/abs/2402.08939v3
Date: Tue, 28 May 2024 04:32:09 GMT
Title: Premise Order Matters in Reasoning with Large Language Models
Authors: Xinyun Chen, Ryan A. Chi, Xuezhi Wang, Denny Zhou,
Abstract summary: We show that large language models (LLMs) are surprisingly brittle to the ordering of the premises. We observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps.
Score: 57.18850969634412
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have accomplished remarkable reasoning performance in various domains. However, in the domain of reasoning tasks, we discover a frailty: LLMs are surprisingly brittle to the ordering of the premises, despite the fact that such ordering does not alter the underlying task. In particular, we observe that LLMs achieve the best performance when the premise order aligns with the context required in intermediate reasoning steps. For example, in deductive reasoning tasks, presenting the premises in the same order as the ground truth proof in the prompt (as opposed to random ordering) drastically increases the model's accuracy. We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that permuting the premise order can cause a performance drop of over 30%. In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.

Related papers

CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z)
Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation [37.49633143660649]
We introduce an order-centric data augmentation framework based on commutativity in logical reasoning. By leveraging order-centric augmentations, models can develop a more flexible and generalized reasoning process.
arXiv Detail & Related papers (2025-02-27T09:25:50Z)
Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs [10.373838332986738]
Chain-of-Thought (CoT) prompting enhances mathematical reasoning in large language models (LLMs) We present a framework that identifies the premises for each step, to improve the evaluation of reasoning. Our findings highlight the utility of premise-centric representations in addressing complex problem-solving tasks.
arXiv Detail & Related papers (2025-02-04T14:44:58Z)
Prompting Strategies for Enabling Large Language Models to Infer Causation from Correlation [68.58373854950294]
We focus on causal reasoning and address the task of establishing causal relationships based on correlation information. We introduce a prompting strategy for this problem that breaks the original task into fixed subquestions. We evaluate our approach on an existing causal benchmark, Corr2Cause.
arXiv Detail & Related papers (2024-12-18T15:32:27Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models [0.0]
Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains. LLMs often suffer from the "hallucination problem", where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated.
arXiv Detail & Related papers (2024-08-09T14:34:32Z)
Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models [84.15513004135576]
Current research enhances the reasoning performance of Large Language Models (LLMs) by sampling multiple reasoning chains and ensembling based on the answer frequency. This approach fails in scenarios where the correct answers are in the minority. We introduce a hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains.
arXiv Detail & Related papers (2024-05-21T17:12:19Z)
Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification [22.92500697622486]
We propose a framework designed to break down any claim paired with evidence into atomic reasoning types. We use this framework to create Reasoning in Evidence-based Claim Verification (RECV), the first claim verification benchmark. We evaluate three state-of-the-art proprietary LLMs under multiple prompt settings.
arXiv Detail & Related papers (2024-02-16T14:52:05Z)
On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks [17.329365493094542]
We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. We observe significant performance collapse with self-critique and significant performance gains with sound external verification.
arXiv Detail & Related papers (2024-02-12T23:11:01Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning. We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z)
Hypothesis Search: Inductive Reasoning with Language Models [39.03846394586811]
Recent work evaluates large language models on inductive reasoning tasks by directly prompting them yielding "in context learning" This works well for straightforward inductive tasks but performs poorly on complex tasks such as the Abstraction and Reasoning Corpus (ARC) In this work, we propose to improve the inductive reasoning ability of LLMs by generating explicit hypotheses at multiple levels of abstraction.
arXiv Detail & Related papers (2023-09-11T17:56:57Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning [14.663216851932646]
We show that language models tend to perform fairly well at single step inference tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. We propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100%.
arXiv Detail & Related papers (2022-05-19T17:25:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.