Using Counterfactual Tasks to Evaluate the Generality of Analogical
Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2402.08955v1
- Date: Wed, 14 Feb 2024 05:52:23 GMT
- Title: Using Counterfactual Tasks to Evaluate the Generality of Analogical
Reasoning in Large Language Models
- Authors: Martha Lewis and Melanie Mitchell
- Abstract summary: We investigate the generality of analogy-making abilities previously claimed for large language models (LLMs)
We show that while the performance of humans remains high for all the problems, the GPT models' performance declines sharply on the counterfactual set.
- Score: 7.779982757267302
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) have performed well on several reasoning
benchmarks, including ones that test analogical reasoning abilities. However,
it has been debated whether they are actually performing humanlike abstract
reasoning or instead employing less general processes that rely on similarity
to what has been seen in their training data. Here we investigate the
generality of analogy-making abilities previously claimed for LLMs (Webb,
Holyoak, & Lu, 2023). We take one set of analogy problems used to evaluate LLMs
and create a set of "counterfactual" variants-versions that test the same
abstract reasoning abilities but that are likely dissimilar from any
pre-training data. We test humans and three GPT models on both the original and
counterfactual problems, and show that, while the performance of humans remains
high for all the problems, the GPT models' performance declines sharply on the
counterfactual set. This work provides evidence that, despite previously
reported successes of LLMs on analogical reasoning, these models lack the
robustness and generality of human analogy-making.
Related papers
- Evaluating the Robustness of Analogical Reasoning in Large Language Models [6.5855735579366685]
We investigate the robustness of analogy-making abilities previously claimed for LLMs.
We test humans and GPT models on robustness to variants of the original analogy problems.
Unlike humans, the performance of GPT models are susceptible to answer-order effects.
arXiv Detail & Related papers (2024-11-21T15:25:08Z) - A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets.
We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z) - Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance.
Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate.
This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z) - AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies [19.613777134600408]
Analogical thinking allows humans to solve problems in creative ways.
Can language models (LMs) do the same?
benchmarking approach focuses on aspects of this ability that are common among humans.
arXiv Detail & Related papers (2024-02-19T18:56:44Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance [0.0]
We test several ways to learn basic analogical reasoning, specifically focusing on analogies that are more typical of what is used to evaluate analogical reasoning in humans.
Our experiments find that models are able to learn analogical reasoning, even with a small amount of data.
arXiv Detail & Related papers (2023-10-09T10:34:38Z) - Response: Emergent analogical reasoning in large language models [0.034530027457862]
GPT-3 fails to solve simplest variations of the original tasks, whereas human performance remains consistently high across all modified versions.
To strengthen claims of humanlike reasoning such as zero-shot reasoning, it is important to develop approaches that rule out data memorization.
arXiv Detail & Related papers (2023-08-30T16:17:26Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.