Related papers: Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

URL: http://arxiv.org/abs/2404.19432v1
Date: Tue, 30 Apr 2024 10:28:04 GMT
Title: Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships
Authors: D. Panas, S. Seth, V. Belle,
Abstract summary: We show that Large Language Models make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release. It is difficult to argue that pure statistical learning can cope with the explosion inherent in many commonsense reasoning tasks. We argue that bigger is not always better and chasing purely statistical improvements is flawed at the core.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Two major areas of interest in the era of Large Language Models regard questions of what do LLMs know, and if and how they may be able to reason (or rather, approximately reason). Since to date these lines of work progressed largely in parallel (with notable exceptions), we are interested in investigating the intersection: probing for reasoning about the implicitly-held knowledge. Suspecting the performance to be lacking in this area, we use a very simple set-up of comparisons between cardinalities associated with elements of various subjects (e.g. the number of legs a bird has versus the number of wheels on a tricycle). We empirically demonstrate that although LLMs make steady progress in knowledge acquisition and (pseudo)reasoning with each new GPT release, their capabilities are limited to statistical inference only. It is difficult to argue that pure statistical learning can cope with the combinatorial explosion inherent in many commonsense reasoning tasks, especially once arithmetical notions are involved. Further, we argue that bigger is not always better and chasing purely statistical improvements is flawed at the core, since it only exacerbates the dangerous conflation of the production of correct answers with genuine reasoning ability.

Related papers

Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models [40.47361817762135]
Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training.<n>LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones.<n>We propose a benchmark Com$2$ focusing on complex commonsense reasoning.
arXiv Detail & Related papers (2025-06-08T09:53:08Z)
Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition [11.422434149376478]
Large Language Models (LLMs) have been touted as AI models possessing advanced reasoning abilities. In theory, autoregressive LLMs with Chain-of-Thought (CoT) can perform more serial computations to solve complex reasoning tasks. Recent studies suggest that, despite this capacity, LLMs do not truly learn to reason but instead fit on statistical features.
arXiv Detail & Related papers (2025-04-04T20:57:36Z)
Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state-of-the-art language models rely on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge.
arXiv Detail & Related papers (2024-10-31T12:48:58Z)
Mixture of Parrots: Experts improve memorization more than reasoning [72.445819694797]
We show that as we increase the number of experts, the memorization performance consistently increases while the reasoning capabilities saturate. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
arXiv Detail & Related papers (2024-10-24T17:54:41Z)
Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs. We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z)
Assessing the Emergent Symbolic Reasoning Abilities of Llama Large Language Models [47.129504708849446]
Large Language Models (LLMs) achieve impressive performance in a wide range of tasks. LLMs show emergent abilities in mathematical reasoning benchmarks. We evaluate three models of the Llama 2 family on different symbolic reasoning tasks.
arXiv Detail & Related papers (2024-06-05T12:22:43Z)
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z)
The Impact of Reasoning Step Length on Large Language Models [40.546685248243534]
Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models. We investigate the correlation between the effectiveness of CoT and the length of reasoning steps in prompts.
arXiv Detail & Related papers (2024-01-10T04:37:38Z)
CLadder: Assessing Causal Reasoning in Language Models [82.8719238178569]
We investigate whether large language models (LLMs) can coherently reason about causality. We propose a new NLP task, causal inference in natural language, inspired by the "causal inference engine" postulated by Judea Pearl et al.
arXiv Detail & Related papers (2023-12-07T15:12:12Z)
Can Large Language Models Infer Causation from Correlation? [104.96351414570239]
We test the pure causal inference skills of large language models (LLMs) We formulate a novel task Corr2Cause, which takes a set of correlational statements and determines the causal relationship between the variables. We show that these models achieve almost close to random performance on the task.
arXiv Detail & Related papers (2023-06-09T12:09:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.