Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation
- URL: http://arxiv.org/abs/2507.02253v1
- Date: Thu, 03 Jul 2025 03:02:49 GMT
- Title: Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation
- Authors: Jungkoo Kang,
- Abstract summary: NL2FLOW is a fully automated system for parametrically generating planning problems.<n>Highest problem models achieved 86% success in generating valid plans.<n>Highest success rate for translating natural language into a representation of a plan was lower than the highest rate of generating a valid plan directly.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Progress in enhancing large language model (LLM) planning and reasoning capabilities is significantly hampered by the bottleneck of scalable, reliable data generation and evaluation. To overcome this, I introduce NL2FLOW, a fully automated system for parametrically generating planning problems - expressed in natural language, a structured intermediate representation, and formal PDDL - and rigorously evaluating the quality of generated plans. I demonstrate NL2FLOW's capabilities by generating a dataset of 2296 problems in the automated workflow generation domain and evaluating multiple open-sourced, instruct-tuned LLMs. My results reveal that the highest performing models achieved 86% success in generating valid plans and 69% in generating optimal plans, specifically for problems with feasible solutions. Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Notably, I observed that the highest success rate for translating natural language into a JSON representation of a plan was lower than the highest rate of generating a valid plan directly. This suggests that unnecessarily decomposing the reasoning task - introducing intermediate translation steps - may actually degrade performance, implying a benefit to models capable of reasoning directly from natural language to action. As I scale LLM reasoning to increasingly complex problems, the bottlenecks and sources of error within these systems will inevitably shift. Therefore, a dynamic understanding of these limitations - and the tools to systematically reveal them - will be crucial for unlocking the full potential of LLMs as intelligent problem solvers.
Related papers
- Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data [10.698357983420928]
This work aims to improve the robustness of Large Language Models against potential adversarial inputs.<n>We systematically evaluated robustness by fine-tuning models using datasets perturbed at character-level, word-level, and sentence-level.<n>Fine-tuning models with perturbed datasets significantly improves model robustness (RD usually drops around 4% - 6%), especially for models with relatively weak robustness.
arXiv Detail & Related papers (2026-02-11T22:30:01Z) - Model-First Reasoning LLM Agents: Reducing Hallucinations through Explicit Problem Modeling [3.537921035534423]
Large Language Models (LLMs) often struggle with complex multi-step planning tasks.<n>Existing strategies such as Chain-of-Thought and ReAct rely on implicit state tracking and lack an explicit problem representation.<n>We propose Model-First Reasoning (MFR), a two-phase paradigm in which the LLM first constructs an explicit model of the problem.
arXiv Detail & Related papers (2025-12-16T15:07:36Z) - CausalPlan: Empowering Efficient LLM Multi-Agent Collaboration Through Causality-Driven Planning [25.322580535468013]
CausalPlan is a framework that integrates explicit structural causal reasoning into the large language model (LLM) planning process.<n>We evaluate CausalPlan on the Overcooked-AI benchmark across five multi-agent coordination tasks and four LLMs of varying sizes.<n>Results show that CausalPlan consistently reduces invalid actions and improves collaboration in both AI-AI and human-AI settings.
arXiv Detail & Related papers (2025-08-19T10:37:20Z) - PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving [66.42260489147617]
We introduce PLAN-TUNING, a framework that distills synthetic task decompositions from large-scale language models.<n>Plan-TUNING fine-tunes smaller models via supervised and reinforcement-learning objectives to improve complex reasoning.<n>Our analysis demonstrates how planning trajectories improves complex reasoning capabilities.
arXiv Detail & Related papers (2025-07-10T07:30:44Z) - VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots [44.99833362998488]
We propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments.<n>The module uses the reasoning capabilities of the Large Language Models to evaluate logical coherence and identify potential gaps in the plan.<n>We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems.
arXiv Detail & Related papers (2025-07-07T15:31:36Z) - Addressing the Challenges of Planning Language Generation [6.209697341255856]
We design and evaluate 8 different PDDL generation pipelines with open-source models under 50 billion parameters.<n>We find that intuitive approaches such as using a high-resource language wrapper or constrained decoding with grammar decrease performance, yet inference-time scaling approaches such as revision with feedback from the solver and plan validator more than double the performance.
arXiv Detail & Related papers (2025-05-20T17:25:23Z) - Improving Large Language Model Planning with Action Sequence Similarity [50.52049888490524]
In this work, we explore how to improve the model planning capability through in-context learning (ICL)<n>We propose GRASE-DC: a two-stage pipeline that first re-samples high AS exemplars and then curates the selected exemplars.<n>Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks.
arXiv Detail & Related papers (2025-05-02T05:16:17Z) - Self-Corrective Task Planning by Inverse Prompting with Large Language Models [9.283971287618261]
We introduce InversePrompt, a novel self-corrective task planning approach.<n>Our method incorporates reasoning steps to provide clear, interpretable feedback.<n>Results on benchmark datasets show an average 16.3% higher success rate over existing LLM-based task planning methods.
arXiv Detail & Related papers (2025-03-10T13:35:51Z) - Complex LLM Planning via Automated Heuristics Discovery [48.07520536415374]
We consider enhancing large language models (LLMs) for complex planning tasks.<n>We propose automated inferences discovery (AutoHD), a novel approach that enables LLMs to explicitly generate functions to guide-time search.<n>Our proposed method requires no additional model training or finetuning--and the explicit definition of functions generated by the LLMs provides interpretability and insights into the reasoning process.
arXiv Detail & Related papers (2025-02-26T16:52:31Z) - Zero-shot Robotic Manipulation with Language-guided Instruction and Formal Task Planning [16.89900521727246]
We propose an innovative language-guided symbolic task planning (LM-SymOpt) framework with optimization.<n>It is the first expert-free planning framework since we combine the world knowledge from Large Language Models with formal reasoning.<n>Our experimental results show that LM-SymOpt outperforms existing LLM-based planning approaches.
arXiv Detail & Related papers (2025-01-25T13:33:22Z) - Interactive and Expressive Code-Augmented Planning with Large Language Models [62.799579304821826]
Large Language Models (LLMs) demonstrate strong abilities in common-sense reasoning and interactive decision-making.
Recent techniques have sought to structure LLM outputs using control flow and other code-adjacent techniques to improve planning performance.
We propose REPL-Plan, an LLM planning approach that is fully code-expressive and dynamic.
arXiv Detail & Related papers (2024-11-21T04:23:17Z) - Non-myopic Generation of Language Models for Reasoning and Planning [45.75146679449453]
This paper proposes a novel method, Predictive-Decoding, that leverages Model Predictive Control to enhance planning accuracy.
Our experiments show significant improvements in a wide range of tasks for math, coding, and agents.
arXiv Detail & Related papers (2024-10-22T17:13:38Z) - Benchmarking Agentic Workflow Generation [80.74757493266057]
We introduce WorfBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures.<n>We also present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms.<n>We observe that the generated can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.
arXiv Detail & Related papers (2024-10-10T12:41:19Z) - Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes.
CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks.
It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks.
However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs.
We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z) - Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents [39.53593677934238]
Large Language Models (LLMs) enable AI Agents to automatically generate and execute multi-step plans to solve complex tasks.
However, current LLM-based agents frequently generate invalid or non-executable plans.
This paper proposes a novel "Formal-LLM" framework for LLM-based agents by integrating the expressiveness of natural language and the precision of formal language.
arXiv Detail & Related papers (2024-02-01T17:30:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.