Related papers: Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

URL: http://arxiv.org/abs/2406.12172v3
Date: Mon, 15 Sep 2025 06:17:31 GMT
Title: Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems
Authors: Nasim Borazjanizadeh, Roei Herzig, Trevor Darrell, Rogerio Feris, Leonid Karlinsky,
Abstract summary: Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks.<n>We introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles.<n>We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench.
Score: 62.76627483915117
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, which contains 11 unique search problems inspired by intuitive puzzles. Each SearchBench problem type is equipped with automated pipelines to generate an arbitrary number of instances and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that using step-by-step, language-only reasoning, even the most advanced LLMs fail to solve SearchBench; for example, OpenAI's frontier models GPT-4 and o1-preview solve only 1.4% and 18.6% of problems, respectively. The reason is that SearchBench problems require considering multiple pathways and performing backtracking, posing a significant challenge to auto-regressive models. Interestingly, performance is significantly boosted when we prompt models to generate a complete A* search algorithm - a comparatively more cognitively difficult task. This approach effectively offloads the iterative search and backtracking process from the models, which they struggle with in text. This in-context learning baseline is further enhanced via a Multi-Stage-Multi-Try (MSMT) inference method, increasing GPT-4's rate of correct solutions to over 57%.

Related papers

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding [59.60915947702282]
Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs)<n>Existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability.<n>We propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region.
arXiv Detail & Related papers (2025-09-08T17:36:21Z)
AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming [2.3020018305241337]
We introduce AlgoSimBench, a new benchmark designed to assess ability to identify algorithmically similar problems (ASPs)<n>AlgoSimBench consists of 1317 problems, annotated with distinct fine-grained algorithm tags, from which we distract 402 multiple-choice questions (MCQs)<n>Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task.<n>We propose attempted solution matching (ASM), a novel method for improving problem similarity detection.
arXiv Detail & Related papers (2025-07-21T08:34:20Z)
Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
LLM-First Search: Self-Guided Exploration of the Solution Space [29.780554400938335]
Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute.<n>We propose textbfLLM-First Search (LFS), a novel textitLLM Self-Guided Search method.
arXiv Detail & Related papers (2025-06-05T16:27:49Z)
OR-LLM-Agent: Automating Modeling and Solving of Operations Research Optimization Problems with Reasoning LLM [15.260794368585692]
We propose OR-LLM-Agent, an AI agent framework built on reasoning LLMs for automated Operations Research problem solving.<n>We show that OR-LLM-Agent utilizing DeepSeek-R1 in its framework outperforms advanced methods, including GPT-o3, Gemini 2.5 Pro, DeepSeek-R1, and ORLM, by at least 7% in accuracy.
arXiv Detail & Related papers (2025-03-13T03:40:50Z)
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [50.419872452397684]
Search-R1 is an extension of reinforcement learning for reasoning frameworks. It generates search queries during step-by-step reasoning with real-time retrieval. It improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines.
arXiv Detail & Related papers (2025-03-12T16:26:39Z)
Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search [76.54475437069395]
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information.<n>We propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning [83.03531832811386]
BoostStep is a method that enhances reasoning accuracy through step-aligned ICL examples. It integrates seamlessly with chain-of-thought (CoT) and tree search algorithms. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
arXiv Detail & Related papers (2025-01-06T18:59:13Z)
BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search [22.672130194493793]
Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains. They still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics. We propose a novel approach, BEATS, to enhance mathematical problem-solving abilities.
arXiv Detail & Related papers (2024-09-26T15:47:42Z)
Program Slicing in the Era of Large Language Models [7.990456190723922]
Program slicing is a critical technique in software engineering, enabling developers to isolate relevant portions of code. This study investigates the application of large language models (LLMs) to both static and dynamic program slicing.
arXiv Detail & Related papers (2024-09-19T00:07:56Z)
See What LLMs Cannot Answer: A Self-Challenge Framework for Uncovering LLM Weaknesses [51.975495361024606]
We propose a Self-Challenge evaluation framework with human-in-the-loop. Starting from seed instances that GPT-4 fails to answer, we prompt GPT-4 to summarize error patterns that can be used to generate new instances. We then build a benchmark, SC-G4, consisting of 1,835 instances generated by GPT-4 using these patterns, with human-annotated gold responses.
arXiv Detail & Related papers (2024-08-16T19:01:52Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
Evaluating LLMs with Multiple Problems at once [9.173325772800341]
This paper shows the benefits and fruitfulness of evaluating LLMs with multiple problems at once.<n>We introduce a new benchmark called ZeMPE (Zero-shot Multi-Problem Evaluation), comprising 53,100 zero-shot multi-problem prompts.<n>Our results show that LLMs are capable of handling multiple problems from a single data source as well as handling them separately, but there are conditions this multiple problem handling capability falls short.
arXiv Detail & Related papers (2024-06-16T02:52:32Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Thought of Search: Planning with Language Models Through The Lens of Efficiency [22.47015814897628]
We argue that recent trends abandon both soundness and completeness for the sake of inefficiency. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy.
arXiv Detail & Related papers (2024-04-18T01:27:29Z)
Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad. We construct and test a range of LM inference methods for competitive programming for the first time. We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z)
FCoReBench: Can Large Language Models Solve Challenging First-Order Combinatorial Reasoning Problems? [25.352721856952655]
First-order reasoning problems can be instantiated with infinite number of problem instances of varying sizes. We present FCoReBench, a dataset of 40 such challenging problems, along with scripts to generate problem instances of varying sizes and automatically verify and generate their solutions. We propose SymPro-LM, which combines LLMs with both symbolic solvers and program interpreters, along with feedback from a few solved examples, to achieve huge performance gains.
arXiv Detail & Related papers (2024-02-04T20:56:09Z)
Improving Large Language Model Fine-tuning for Solving Math Problems [20.417053742869403]
A large gap exists between large language models' pass-at-one and pass-at-N performance in solving math problems. Using the challenging MATH dataset, we investigate three fine-tuning strategies. We design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models.
arXiv Detail & Related papers (2023-10-16T04:11:19Z)
Autonomous Tree-search Ability of Large Language Models [58.68735916408101]
Large Language Models have excelled in remarkable reasoning capabilities with advanced prompting techniques. Recent works propose to utilize external programs to define search logic, such that LLMs can perform passive tree search to solve more challenging reasoning tasks. We propose a new concept called autonomous tree-search ability of LLM, which can automatically generate a response containing search trajectories for the correct answer.
arXiv Detail & Related papers (2023-10-14T14:14:38Z)
ALGO: Synthesizing Algorithmic Programs with LLM-Generated Oracle Verifiers [60.6418431624873]
Large language models (LLMs) excel at implementing code from functionality descriptions but struggle with algorithmic problems. We propose ALGO, a framework that synthesizes Algorithmic programs with LLM-Generated Oracles to guide the generation and verify their correctness. Experiments show that when equipped with ALGO, we achieve an 8x better one-submission pass rate over the Codex model and a 2.6x better one-submission pass rate over CodeT.
arXiv Detail & Related papers (2023-05-24T00:10:15Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.