Related papers: Detecting and Characterizing Planning in Language Models

Detecting and Characterizing Planning in Language Models

URL: http://arxiv.org/abs/2508.18098v1
Date: Mon, 25 Aug 2025 14:59:46 GMT
Title: Detecting and Characterizing Planning in Language Models
Authors: Jatin Nainani, Sankaran Vaidyanathan, Connor Watts, Andre N. Assis, Alice Rigg,
Abstract summary: We present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline.<n>We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task.<n>Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions.
Score: 1.320426480090921
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.

Related papers

Large Language Models Can Take False First Steps at Inference-time Planning [2.6100783621884625]
Large language models (LLMs) have been shown to acquire sequence-level planning abilities during training.<n>Planing behavior exhibited at inference time often appears short-sighted and inconsistent with these capabilities.<n>We propose a Bayesian account for this gap by grounding planning behavior in the evolving generative context.
arXiv Detail & Related papers (2026-02-03T01:54:55Z)
Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study [0.0]
Large Language Models have sparked interest in their potential for robotic task planning.<n>While these models demonstrate strong generative capabilities, their effectiveness in producing structured and executable plans remains uncertain.<n>This paper presents a systematic evaluation of a broad spectrum of current state of the art language models.
arXiv Detail & Related papers (2025-07-31T14:25:54Z)
LLMs as Planning Modelers: A Survey for Leveraging Large Language Models to Construct Automated Planning Models [24.230622369142193]
Large Language Models (LLMs) excel in various natural language tasks but often struggle with long-horizon planning problems.<n>This limitation has drawn interest in integrating neuro-symbolic approaches within the Automated Planning (AP) and Natural Language Processing (NLP) communities.
arXiv Detail & Related papers (2025-03-22T03:35:44Z)
Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs) We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
Learning to Plan and Generate Text with Citations [69.56850173097116]
We explore the attribution capabilities of plan-based models which have been recently shown to improve the faithfulness, grounding, and controllability of generated text. We propose two attribution models that utilize different variants of blueprints, an abstractive model where questions are generated from scratch, and an extractive model where questions are copied from the input.
arXiv Detail & Related papers (2024-04-04T11:27:54Z)
PARADISE: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset [0.0]
We present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow. It involves warning and tip inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal. Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios.
arXiv Detail & Related papers (2024-03-05T18:01:59Z)
What's the Plan? Evaluating and Developing Planning-Aware Techniques for Language Models [7.216683826556268]
Large language models (LLMs) are increasingly used for applications that require planning capabilities. We introduce SimPlan, a novel hybrid-method, and evaluate its performance in a new challenging setup.
arXiv Detail & Related papers (2024-02-18T07:42:49Z)
Tree-Planner: Efficient Close-loop Task Planning with Large Language Models [63.06270302774049]
Tree-Planner reframes task planning with Large Language Models into three distinct phases. Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.
arXiv Detail & Related papers (2023-10-12T17:59:50Z)
Skill Induction and Planning with Latent Language [94.55783888325165]
We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions. We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks. In trained models, the space of natural language commands indexes a library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals.
arXiv Detail & Related papers (2021-10-04T15:36:32Z)
Divide-and-Conquer Monte Carlo Tree Search For Goal-Directed Planning [78.65083326918351]
We consider alternatives to an implicit sequential planning assumption. We propose Divide-and-Conquer Monte Carlo Tree Search (DC-MCTS) for approximating the optimal plan. We show that this algorithmic flexibility over planning order leads to improved results in navigation tasks in grid-worlds.
arXiv Detail & Related papers (2020-04-23T18:08:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.