PuzzleBench: Can LLMs Solve Challenging First-Order Combinatorial
Reasoning Problems?
- URL: http://arxiv.org/abs/2402.02611v2
- Date: Thu, 22 Feb 2024 14:42:45 GMT
- Title: PuzzleBench: Can LLMs Solve Challenging First-Order Combinatorial
Reasoning Problems?
- Authors: Chinmay Mittal, Krishna Kartik, Mausam, Parag Singla
- Abstract summary: We present PuzzleBench, a dataset of 31 such challenging problems along with a few solved instances for each problem.
These problems are all first order, i.e., they can be instantiated with problem instances of varying sizes, and most of them are NP-hard.
We first observe that LLMs, even when aided by symbolic solvers, perform rather poorly on our dataset.
In response, we propose a new approach, Puzzle-LM, which combines LLMs with both symbolic solvers and interpreter.
- Score: 27.696027301600793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent works show that the largest of the large language models (LLMs) can
solve many simple reasoning tasks expressed in natural language, without
any/much supervision. But, can they also solve challenging first-order
combinatorial reasoning problems, such as graph coloring, knapsack and
cryptarithmetic? To answer this question, we present PuzzleBench, a dataset of
31 such challenging problems along with a few solved instances for each
problem. These problems are all first order, i.e., they can be instantiated
with problem instances of varying sizes, and most of them are NP-hard,
requiring several reasoning steps to reach the solution. We first observe that
LLMs, even when aided by symbolic solvers, perform rather poorly on our
dataset. In response, we propose a new approach, Puzzle-LM, which combines LLMs
with both symbolic solvers and program interpreters, along with feedback from
solved examples, to achieve huge performance gains. Our extensive
experimentation and analyses offer new insights into the reasoning abilities
and limitations of present-day LLMs.
Related papers
- Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs.
We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z) - Graph Reasoning with Large Language Models via Pseudo-code Prompting [25.469214467011362]
This paper investigates whether prompting via pseudo-code instructions can improve the performance of large language models (LLMs) in solving graph problems.
Our experiments demonstrate that using pseudo-code instructions generally improves the performance of all considered LLMs.
arXiv Detail & Related papers (2024-09-26T14:52:40Z) - Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter? [36.14795256060537]
We develop GridPuzzle, an evaluation dataset comprising 274 grid-based puzzles with different complexities.
Second, we propose a new error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2.
Third, we develop an LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval, to evaluate the correctness of reasoning chains.
arXiv Detail & Related papers (2024-07-20T07:43:07Z) - Solving Zebra Puzzles Using Constraint-Guided Multi-Agent Systems [25.0042181817455]
We introduce a multi-agent system, ZPS, that integrates Large Language Models with an off the shelf theorem prover.
This system tackles the complex puzzle-solving task by breaking down the problem into smaller, manageable parts.
We also introduce an automated grid puzzle grader to assess the correctness of our puzzle solutions and show that the automated grader is reliable by evaluating it in a user-study.
arXiv Detail & Related papers (2024-07-04T14:22:25Z) - Navigating the Labyrinth: Evaluating and Enhancing LLMs' Ability to Reason About Search Problems [59.72548591120689]
We introduce a new benchmark, SearchBench, containing 11 unique search problem types.
We show that even the most advanced LLMs fail to solve these problems end-to-end in text.
Instructing LLMs to generate code that solves the problem helps, but only slightly, e.g., GPT4's performance rises to 11.7%.
arXiv Detail & Related papers (2024-06-18T00:44:58Z) - Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? [140.9751389452011]
We study the biases of large language models (LLMs) in relation to those known in children when solving arithmetic word problems.
We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features.
arXiv Detail & Related papers (2024-01-31T18:48:20Z) - Thought Propagation: An Analogical Approach to Complex Reasoning with Large Language Models [62.96551299003463]
We propose textbftextitThought Propagation (TP) to enhance the complex reasoning ability of Large Language Models.
TP first prompts LLMs to propose and solve a set of analogous problems that are related to the input one.
TP reuses the results of analogous problems to directly yield a new solution or derive a knowledge-intensive plan for execution to amend the initial solution obtained from scratch.
arXiv Detail & Related papers (2023-10-06T01:40:09Z) - Fill in the Blank: Exploring and Enhancing LLM Capabilities for Backward Reasoning in Math Word Problems [17.80128896525717]
backward reasoning is relatively unexplored.
backward reasoning can be seen as the ''inverse'' of forward reasoning.
We propose variations of three different forward reasoning strategies to improve performance.
arXiv Detail & Related papers (2023-10-03T12:03:06Z) - Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks.
These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer.
Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.