Related papers: Reasoning about Affordances: Causal and Compositional Reasoning in LLMs

Reasoning about Affordances: Causal and Compositional Reasoning in LLMs

URL: http://arxiv.org/abs/2502.16606v1
Date: Sun, 23 Feb 2025 15:21:47 GMT
Title: Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Authors: Magnus F. Gjerde, Vanessa Cheung, David Lagnado,
Abstract summary: We investigate the causal and compositional reasoning abilities of Large Language Models (LLMs) and humans in the domain of object affordances.<n>In Experiment 1, we evaluated GPT-3.5 and GPT-4o, finding that GPT-4o performed on par with human participants, while GPT-3.5 lagged significantly.<n>In Experiment 2, we introduced two new conditions, Distractor and Image, and evaluated Claude 3 Sonnet and Claude 3.5 Sonnet in addition to the GPT models.<n>The Distractor condition significantly impaired performance across humans and models, although GPT-4o and Claude 3.5 still performed well above
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid progress of Large Language Models (LLMs), it becomes increasingly important to understand their abilities and limitations. In two experiments, we investigate the causal and compositional reasoning abilities of LLMs and humans in the domain of object affordances, an area traditionally linked to embodied cognition. The tasks, designed from scratch to avoid data contamination, require decision-makers to select unconventional objects to replace a typical tool for a particular purpose, such as using a table tennis racket to dig a hole. In Experiment 1, we evaluated GPT-3.5 and GPT-4o, finding that GPT-4o, when given chain-of-thought prompting, performed on par with human participants, while GPT-3.5 lagged significantly. In Experiment 2, we introduced two new conditions, Distractor (more object choices, increasing difficulty) and Image (object options presented visually), and evaluated Claude 3 Sonnet and Claude 3.5 Sonnet in addition to the GPT models. The Distractor condition significantly impaired performance across humans and models, although GPT-4o and Claude 3.5 still performed well above chance. Surprisingly, the Image condition had little impact on humans or GPT-4o, but significantly lowered Claude 3.5's accuracy. Qualitative analysis showed that GPT-4o and Claude 3.5 have a stronger ability than their predecessors to identify and flexibly apply causally relevant object properties. The improvement from GPT-3.5 and Claude 3 to GPT-4o and Claude 3.5 suggests that models are increasingly capable of causal and compositional reasoning in some domains, although further mechanistic research is necessary to understand how LLMs reason.

Related papers

Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z)
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [49.083544963243206]
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning.<n>We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
arXiv Detail & Related papers (2025-05-21T18:33:50Z)
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [135.1260782461186]
Chain-of-Thought (CoT) significantly enhances formal reasoning capabilities in Large Language Models (LLMs) However, improving reasoning in Small Language Models (SLMs) remains challenging due to their limited model capacity. We present a systematic training recipe for SLMs that consists of four steps: (1) large-scale mid-training on diverse distilled long-CoT data, (2) supervised fine-tuning on high-quality long-CoT data, (3) Rollout DPO leveraging a carefully curated preference dataset, and (4) Reinforcement Learning (RL) with Verifiable Reward.
arXiv Detail & Related papers (2025-04-30T00:04:35Z)
Do Large Language Models Reason Causally Like Us? Even Better? [7.749713014052951]
Large language models (LLMs) have shown impressive capabilities in generating human-like text.<n>We compare causal reasoning in humans and four LLMs using tasks based on collider graphs.<n>We find that LLMs reason causally along a spectrum from human-like to normative inference, with alignment shifting based on model, context, and task.
arXiv Detail & Related papers (2025-02-14T15:09:15Z)
Are UFOs Driving Innovation? The Illusion of Causality in Large Language Models [0.0]
This research investigates whether large language models develop the illusion of causality in real-world settings. We evaluated and compared news headlines generated by GPT-4o-Mini, Claude-3.5-Sonnet, and Gemini-1.5-Pro. We found that Claude-3.5-Sonnet is the model that presents the lowest degree of causal illusion aligned with experiments on Correlation-to-Causation Exaggeration.
arXiv Detail & Related papers (2024-10-15T15:20:49Z)
In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions [2.974480694911691]
This study evaluates the performance of three leading large language models: GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Our results indicate that GPT-4o excels in zero-shot scenarios for simpler, shorter documents, while Claude 3.5 Sonnet surpasses GPT-4o in handling more complex, sentiment-fluctuating opinions.
arXiv Detail & Related papers (2024-10-15T04:42:21Z)
Using GPT-4 to guide causal machine learning [5.953513005270839]
We focus on the well-established GPT-4 (Turbo) and evaluate its performance under the most restrictive conditions.<n>We show that questionnaire participants judge the GPT-4 graphs as the most accurate in the evaluated categories.<n>We show that pairing GPT-4 with causal ML overcomes this limitation, resulting in graphical structures learnt from real data that align more closely with those identified by domain experts.
arXiv Detail & Related papers (2024-07-26T08:59:26Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
Unveiling Divergent Inductive Biases of LLMs on Temporal Data [4.561800294155325]
This research focuses on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data. biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''
arXiv Detail & Related papers (2024-04-01T19:56:41Z)
Can Large Language Models do Analytical Reasoning? [45.69642663863077]
This paper explores the cutting-edge Large Language Model with analytical reasoning on sports. We find that GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. To our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores.
arXiv Detail & Related papers (2024-03-06T20:22:08Z)
Distortions in Judged Spatial Relations in Large Language Models [45.875801135769585]
GPT-4 exhibited superior performance with 55 percent accuracy, followed by GPT-3.5 at 47 percent, and Llama-2 at 45 percent. The models identified the nearest cardinal direction in most cases, reflecting their associative learning mechanism.
arXiv Detail & Related papers (2024-01-08T20:08:04Z)
Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
Spatio-Temporal Graph Contrastive Learning [49.132528449909316]
We propose a Spatio-Temporal Graph Contrastive Learning framework (STGCL) to tackle these issues. We elaborate on four types of data augmentations which disturb data in terms of graph structure, time domain, and frequency domain. Our framework is evaluated across three real-world datasets and four state-of-the-art models.
arXiv Detail & Related papers (2021-08-26T16:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.