ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- URL: http://arxiv.org/abs/2502.01100v1
- Date: Mon, 03 Feb 2025 06:44:49 GMT
- Title: ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- Authors: Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi,
- Abstract summary: We introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance.
ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity.
Our results reveal a significant decline in accuracy as problem complexity grows.
- Score: 92.76959707441954
- License:
- Abstract: We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.
Related papers
- Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment [21.12989936864145]
Chain-of-Thought (CoT) prompting has shown promise in enhancing the reasoning capabilities of large language models (LLMs)
We propose Reasoning-as-Logic-Units (RaLU), which constructs a more reliable reasoning path by aligning logical units between the generated program and their corresponding NL descriptions.
arXiv Detail & Related papers (2025-02-05T08:23:18Z) - JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models.
JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.
Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z) - VERUS-LM: a Versatile Framework for Combining LLMs with Symbolic Reasoning [8.867818326729367]
We introduce VERUS-LM, a novel framework for neurosymbolic reasoning.
VERUS-LM employs a generic prompting mechanism, clearly separates domain knowledge from queries.
We show that our approach succeeds in diverse reasoning on a novel dataset, markedly outperforming LLMs.
arXiv Detail & Related papers (2025-01-24T14:45:21Z) - Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question.
To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA)
Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z) - Inductive Learning of Logical Theories with LLMs: An Expressivity-Graded Analysis [9.865771016218549]
This work presents a novel systematic methodology to analyse the capabilities and limitations of Large Language Models (LLMs)
The analysis is complexity-graded w.r.t. rule dependency structure, allowing quantification of specific inference challenges on LLM performance.
arXiv Detail & Related papers (2024-08-15T16:41:00Z) - CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge [44.59258397967782]
Large language models (LLMs) have demonstrated impressive capabilities across various natural language processing tasks.
We present a systematic evaluation of state-of-the-art LLMs' complex logical reasoning abilities.
We find that LLMs excel at reasoning over general world knowledge but face significant challenges with specialized domain-specific knowledge.
arXiv Detail & Related papers (2024-07-30T05:40:32Z) - Aggregation of Reasoning: A Hierarchical Framework for Enhancing Answer Selection in Large Language Models [84.15513004135576]
Current research enhances the reasoning performance of Large Language Models (LLMs) by sampling multiple reasoning chains and ensembling based on the answer frequency.
This approach fails in scenarios where the correct answers are in the minority.
We introduce a hierarchical reasoning aggregation framework AoR, which selects answers based on the evaluation of reasoning chains.
arXiv Detail & Related papers (2024-05-21T17:12:19Z) - Puzzle Solving using Reasoning of Large Language Models: A Survey [1.9939549451457024]
This survey examines the capabilities of Large Language Models (LLMs) in puzzle solving.
Our findings highlight the disparity between LLM capabilities and human-like reasoning.
The survey underscores the necessity for novel strategies and richer datasets to advance LLMs' puzzle-solving proficiency.
arXiv Detail & Related papers (2024-02-17T14:19:38Z) - When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities.
Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity.
Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z) - Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training.
We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion.
The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.