SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
- URL: http://arxiv.org/abs/2505.14615v1
- Date: Tue, 20 May 2025 17:00:22 GMT
- Title: SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
- Authors: Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, Alex Aiken,
- Abstract summary: SATBench is a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs)<n>Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs.<n>Experiments show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems.
- Score: 16.721380234044027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.
Related papers
- Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language [67.51318974970985]
Solving puzzles in natural language poses a long-standing challenge in AI.<n>We propose Logic-of-Thought, a framework that bridges large language models with logic programming.<n>We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks.
arXiv Detail & Related papers (2025-05-22T01:37:40Z) - SolSearch: An LLM-Driven Framework for Efficient SAT-Solving Code Generation [13.056487325961688]
The Satisfiability (SAT) problem is a core challenge with significant applications in software engineering.<n>This paper presents SolSearch, a novel framework that harnesses large language models (LLMs) to discover and optimize SAT-solving strategies automatically.
arXiv Detail & Related papers (2025-02-20T07:25:21Z) - On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.<n>One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.<n>We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z) - General Method for Solving Four Types of SAT Problems [12.28597116379225]
Existing methods provide varying algorithms for different types of Boolean satisfiability problems (SAT)
This study proposes a unified framework DCSAT based on integer programming and reinforcement learning (RL) algorithm to solve different types of SAT problems.
arXiv Detail & Related papers (2023-12-27T06:09:48Z) - Explaining SAT Solving Using Causal Reasoning [30.469229388827443]
We introduce CausalSAT, which employs causal reasoning to gain insights into the functioning of modern SAT solvers.
We use CausalSAT to quantitatively verify hypotheses previously regarded as "rules of thumb" or empirical findings.
arXiv Detail & Related papers (2023-06-09T22:53:16Z) - Logic-LM: Empowering Large Language Models with Symbolic Solvers for
Faithful Logical Reasoning [101.26814728062065]
Large Language Models (LLMs) have shown human-like reasoning abilities but still struggle with complex logical problems.
This paper introduces a novel framework, Logic-LM, which integrates LLMs with symbolic solvers to improve logical problem-solving.
arXiv Detail & Related papers (2023-05-20T22:25:38Z) - SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs)
We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer.
We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z) - Estimating the hardness of SAT encodings for Logical Equivalence
Checking of Boolean circuits [58.83758257568434]
We show that the hardness of SAT encodings for LEC instances can be estimated textitw.r.t some SAT partitioning.
The paper proposes several methods for constructing partitionings, which, when used in practice, allow one to estimate the hardness of SAT encodings for LEC with good accuracy.
arXiv Detail & Related papers (2022-10-04T09:19:13Z) - SATformer: Transformer-Based UNSAT Core Learning [0.0]
This paper introduces SATformer, a Transformer-based approach for the Boolean Satisfiability (SAT) problem.
Rather than solving the problem directly, SATformer approaches the problem from the opposite direction by focusing on unsatisfiability.
SATformer is trained through a multi-task learning approach, using the single-bit satisfiability result and the minimal unsatisfiable core.
Experimental results show that our SATformer can decrease the runtime of existing solvers by an average of 21.33%.
arXiv Detail & Related papers (2022-09-02T11:17:39Z) - DeepSAT: An EDA-Driven Learning Framework for SAT [9.111341161918375]
We present DeepSAT, a novel end-to-end learning framework for the Boolean satisfiability (SAT) problem.
DeepSAT achieves significant accuracy improvements over state-of-the-art learning-based SAT solutions.
arXiv Detail & Related papers (2022-05-27T03:20:42Z) - Machine Learning Methods in Solving the Boolean Satisfiability Problem [72.21206588430645]
The paper reviews the recent literature on solving the Boolean satisfiability problem (SAT) with machine learning techniques.
We examine the evolving ML-SAT solvers from naive classifiers with handcrafted features to the emerging end-to-end SAT solvers such as NeuroSAT.
arXiv Detail & Related papers (2022-03-02T05:14:12Z) - Transformer-based Machine Learning for Fast SAT Solvers and Logic
Synthesis [63.53283025435107]
CNF-based SAT and MaxSAT solvers are central to logic synthesis and verification systems.
In this work, we propose a one-shot model derived from the Transformer architecture to solve the MaxSAT problem.
arXiv Detail & Related papers (2021-07-15T04:47:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.