Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models
- URL: http://arxiv.org/abs/2503.23487v1
- Date: Sun, 30 Mar 2025 15:41:55 GMT
- Title: Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models
- Authors: Irtaza Khalid, Amir Masoud Nourollah, Steven Schockaert,
- Abstract summary: Large Language Models (LLMs) have been found to struggle with systematic reasoning.<n>This paper focuses on tasks that require systematic reasoning about relational compositions.<n>We find that the considered LLMs and LRMs overall perform poorly overall, albeit better than random chance.
- Score: 15.56445409535547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have been found to struggle with systematic reasoning. Even on tasks where they appear to perform well, their performance often depends on shortcuts, rather than on genuine reasoning abilities, leading them to collapse on out-of-distribution examples. Post-training strategies based on reinforcement learning and chain-of-thought prompting have recently been hailed as a step change. However, little is still known about the potential of the resulting ``Large Reasoning Models'' (LRMs) beyond problem solving in mathematics and programming, where finding genuine out-of-distribution problems can be difficult. In this paper, we focus on tasks that require systematic reasoning about relational compositions, especially for qualitative spatial and temporal reasoning. These tasks allow us to control the difficulty of problem instances, and measure in a precise way to what extent models can generalise. We find that that the considered LLMs and LRMs overall perform poorly overall, albeit better than random chance.
Related papers
- Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) [66.51642638034822]
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks.
Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains.
This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs.
arXiv Detail & Related papers (2025-04-04T04:04:56Z) - MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer [37.81465564673498]
Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks.
We propose textbfMetaLadder, a framework that explicitly prompts LLMs to recall and reflect on meta-problems.
Our experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy.
arXiv Detail & Related papers (2025-03-19T04:36:35Z) - Navigating Semantic Relations: Challenges for Language Models in Abstract Common-Sense Reasoning [5.4141465747474475]
Large language models (LLMs) have achieved remarkable performance in generating human-like text and solving problems of moderate complexity.
We systematically evaluate abstract common-sense reasoning in LLMs using the ConceptNet knowledge graph.
arXiv Detail & Related papers (2025-02-19T20:20:24Z) - Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.
SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.
We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z) - Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question.
To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA)
Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Can LLMs Reason in the Wild with Programs? [20.47557047823847]
We introduce the task of reasoning in the wild, where an LLM is tasked to solve a reasoning problem of unknown type.
We create a large tactic-guided trajectory dataset containing detailed solutions to a diverse set of reasoning problems.
In experiments, we highlight that existing LLMs fail significantly on problems with ambiguous and mixed scope.
arXiv Detail & Related papers (2024-06-19T18:26:19Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets.
We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z) - Concise and Organized Perception Facilitates Reasoning in Large Language Models [31.238220405009617]
Exploiting large language models (LLMs) to tackle reasoning has garnered growing attention.<n>It still remains highly challenging to achieve satisfactory results in complex logical problems, characterized by plenty of premises within the context and requiring multi-hop reasoning.<n>In this work, we first examine the mechanism from the perspective of information flow and reveal that LLMs confront difficulties akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks.
arXiv Detail & Related papers (2023-10-05T04:47:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.