Using Counterfactual Tasks to Evaluate the Generality of Analogical
Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2402.08955v1
- Date: Wed, 14 Feb 2024 05:52:23 GMT
- Title: Using Counterfactual Tasks to Evaluate the Generality of Analogical
Reasoning in Large Language Models
- Authors: Martha Lewis and Melanie Mitchell
- Abstract summary: We investigate the generality of analogy-making abilities previously claimed for large language models (LLMs)
We show that while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set.
- Score: 7.779982757267302
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have performed well on several reasoning
benchmarks, including ones that test analogical reasoning abilities. However,
it has been debated whether they are actually performing humanlike abstract
reasoning or instead employing less general processes that rely on similarity
to what has been seen in their training data. Here we investigate the
generality of analogy-making abilities previously claimed for LLMs (Webb,
Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs
and create a set of "counterfactual" variants-versions that test the same
abstract reasoning abilities but that are likely dissimilar from any
pre-training data. We test humans and three GPT models on both the original and
counterfactual problems, and show that, while the performance of humans remains
high for all the problems, the GPT models' performance declines sharply on the
counterfactual set. This work provides evidence that, despite previously
reported successes of LLMs on analogical reasoning, these models lack the
robustness and generality of human analogy-making.
Related papers
- Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.
SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.
We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z) - JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models.
JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.
Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z) - Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.
This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.
We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z) - Evaluating the Robustness of Analogical Reasoning in Large Language Models [6.5855735579366685]
We investigate the robustness of analogy-making abilities previously claimed for LLMs.
We test humans and GPT models on robustness to variants of the original analogy problems.
Unlike humans, the performance of GPT models are susceptible to answer-order effects.
arXiv Detail & Related papers (2024-11-21T15:25:08Z) - Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance.
Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate.
This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Response: Emergent analogical reasoning in large language models [0.034530027457862]
GPT-3 fails to solve simplest variations of the original tasks, whereas human performance remains consistently high across all modified versions.
To strengthen claims of humanlike reasoning such as zero-shot reasoning, it is important to develop approaches that rule out data memorization.
arXiv Detail & Related papers (2023-08-30T16:17:26Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.