Related papers: Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning

Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning

URL: http://arxiv.org/abs/2310.03249v2
Date: Wed, 7 Feb 2024 20:18:54 GMT
Title: Can Large Language Models be Good Path Planners? A Benchmark and Investigation on Spatial-temporal Reasoning
Authors: Mohamed Aghzal, Erion Plaku, Ziyu Yao
Abstract summary: Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks. We propose a new benchmark, termed $textbfP$ath $textbfP$lanning from $textbfN$atural $textbfL$anguage.
Score: 10.633920029087676
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved remarkable success across a wide spectrum of tasks; however, they still face limitations in scenarios that demand long-term planning and spatial reasoning. To facilitate this line of research, in this work, we propose a new benchmark, termed $\textbf{P}$ath $\textbf{P}$lanning from $\textbf{N}$atural $\textbf{L}$anguage ($\textbf{PPNL}$). Our benchmark evaluates LLMs' spatial-temporal reasoning by formulating ''path planning'' tasks that require an LLM to navigate to target locations while avoiding obstacles and adhering to constraints. Leveraging this benchmark, we systematically investigate LLMs including GPT-4 via different few-shot prompting methodologies as well as BART and T5 of various sizes via fine-tuning. Our experimental results show the promise of few-shot GPT-4 in spatial reasoning, when it is prompted to reason and act interleavedly, although it still fails to perform long-term temporal reasoning. In contrast, while fine-tuned LLMs achieved impressive results on in-distribution reasoning tasks, they struggled to generalize to larger environments or environments with more obstacles.

Related papers

Limited Reasoning Space: The cage of long-horizon reasoning in LLMs [13.848126962400878]
This work hypothesizes that reasoning failures with larger compute budgets stem from static planning methods.<n>We propose Halo, a model predictive control framework for planning.
arXiv Detail & Related papers (2026-02-22T17:28:27Z)
Language-based Trial and Error Falls Behind in the Era of Experience [50.503828360874536]
Large Language Models (LLMs) excel in language-based agentic tasks, but their applicability to unseen, nonlinguistic environments remains limited.<n>In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration.<n>We propose SCOUT, a novel framework that decouples exploration from semantic exploitation.
arXiv Detail & Related papers (2026-01-29T14:08:41Z)
Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation [32.86351316550696]
We analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps.<n>Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation.<n>To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision.
arXiv Detail & Related papers (2025-10-13T17:02:41Z)
PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations [75.04864582433879]
PlanQA is a diagnostic benchmark for evaluating geometric and spatial reasoning in large-language models.<n>The benchmark uncovers diverse question types that test not only metric and topological reasoning but also interior design constraints.
arXiv Detail & Related papers (2025-07-10T11:16:48Z)
Time's Up! An Empirical Study of LLM Reasoning Ability Under Output Length Constraint [20.685932824324446]
We investigate whether and how the reasoning abilities of Large Language Models (LLMs) remain effective under real-world latency constraints. Specifically, we test more than 25 LLMs on common reasoning datasets under a wide range of output length budgets. The results have demonstrated several interesting findings regarding the budget-aware LLM reasoning.
arXiv Detail & Related papers (2025-04-19T16:32:28Z)
SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs [48.28847964704554]
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks. We propose a novel approach for continuous-space reasoning that does not require modifying the underlying LLM.
arXiv Detail & Related papers (2025-02-17T18:52:29Z)
Large Language Models Can Self-Improve in Long-context Reasoning [100.52886241070907]
Large language models (LLMs) have achieved substantial progress in processing long contexts but still struggle with long-context reasoning. We propose ours, an approach specifically designed for this purpose. ours achieves superior performance compared to prior approaches that depend on data produced by human experts or advanced models.
arXiv Detail & Related papers (2024-11-12T19:53:00Z)
FLARE: Faithful Logic-Aided Reasoning and Exploration [50.9814063216852]
We introduce a novel approach for traversing the problem space using task decompositions. We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code. Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Look Further Ahead: Testing the Limits of GPT-4 in Path Planning [9.461626534488117]
Large Language Models (LLMs) have shown impressive capabilities across a wide variety of tasks. Our proposed benchmark systematically tests path-planning skills in complex settings. We found that framing prompts as Python code and decomposing long trajectory tasks improve GPT-4's path planning effectiveness.
arXiv Detail & Related papers (2024-06-17T18:12:56Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
Can large language models explore in-context? [87.49311128190143]
We deploy Large Language Models as agents in simple multi-armed bandit environments. We find that the models do not robustly engage in exploration without substantial interventions.
arXiv Detail & Related papers (2024-03-22T17:50:43Z)
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation [38.66406497318709]
This work focuses on the tabletop manipulation task and releases a simulation benchmark, textitLoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM.
arXiv Detail & Related papers (2023-10-18T14:53:14Z)
Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency [53.8779374188643]
We propose a principled framework with provable regret guarantees to orchestrate reasoning and acting. Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon. At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
arXiv Detail & Related papers (2023-09-29T16:36:39Z)
Reasoning with Language Model is Planning with World Model [27.24144881796878]
Large language models (LLMs) have shown remarkable reasoning capabilities. LLMs lack an internal $textitworld model$ to predict the world. We propose a new LLM reasoning framework, $underlineR$easoning vi$underlinea$ $underlineP$lanning $textbf(RAP)$.
arXiv Detail & Related papers (2023-05-24T10:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.