Related papers: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability

On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability

URL: http://arxiv.org/abs/2409.19924v4
Date: Mon, 14 Oct 2024 03:41:09 GMT
Title: On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability
Authors: Kevin Wang, Junbo Li, Neel P. Bhatt, Yihan Xi, Qiang Liu, Ufuk Topcu, Zhangyang Wang,
Abstract summary: We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
Score: 59.72892401927283
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Language Models (LLMs) have showcased their ability to perform complex reasoning tasks, but their effectiveness in planning remains underexplored. In this study, we evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks, focusing on three key aspects: feasibility, optimality, and generalizability. Through empirical evaluations on constraint-heavy tasks (e.g., $\textit{Barman}$, $\textit{Tyreworld}$) and spatially complex environments (e.g., $\textit{Termes}$, $\textit{Floortile}$), we highlight o1-preview's strengths in self-evaluation and constraint-following, while also identifying bottlenecks in decision-making and memory management, particularly in tasks requiring robust spatial reasoning. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints and managing state transitions in structured environments. However, the model often generates suboptimal solutions with redundant actions and struggles to generalize effectively in spatially complex tasks. This pilot study provides foundational insights into the planning limitations of LLMs, offering key directions for future research on improving memory management, decision-making, and generalization in LLM-based planning. Code available at https://github.com/VITA-Group/o1-planning.

Related papers

OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling [13.57588221678224]
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling.<n>The boundaries of their capabilities in automated formulation and problem solving remain poorly understood.<n>We propose OPT-ENGINE, a benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels.
arXiv Detail & Related papers (2026-01-09T09:22:33Z)
OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning [50.45036742963495]
We introduce OmniEVA, an embodied versatile planner that enables advanced embodied reasoning and task planning.<n>A Task-Adaptive 3D Grounding mechanism enables context-aware 3D grounding for diverse embodied tasks.<n>An Embodiment-Aware Reasoning framework incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable.
arXiv Detail & Related papers (2025-09-11T10:32:22Z)
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds [0.0]
Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming.<n>But their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored.<n>We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds.
arXiv Detail & Related papers (2025-08-18T09:59:02Z)
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving [66.42260489147617]
We introduce PLAN-TUNING, a framework that distills synthetic task decompositions from large-scale language models.<n>Plan-TUNING fine-tunes smaller models via supervised and reinforcement-learning objectives to improve complex reasoning.<n>Our analysis demonstrates how planning trajectories improves complex reasoning capabilities.
arXiv Detail & Related papers (2025-07-10T07:30:44Z)
Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z)
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning [60.100794160682646]
We propose a new learning framework that jointly optimize state prediction and action selection through preference learning. To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error. Our method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B)
arXiv Detail & Related papers (2025-03-13T15:49:56Z)
Planning Anything with Rigor: General-Purpose Zero-Shot Planning with LLM-based Formalized Programming [13.246017517159043]
Large language models (LLMs) have recently demonstrated strong potential in solving planning problems. We propose LLpreview, a framework that leverages LLMs to capture key information from planning problems and formally formulate and solve them as optimization problems from scratch. We apply LLpreview to 9 planning problems, ranging from multi-constraint decision making to multi-step planning problems, and demonstrate that LLpreview achieves on average 83.7% and 86.8% optimal rate across 9 tasks for GPTo and Claude 3.5 Sonnet.
arXiv Detail & Related papers (2024-10-15T23:20:54Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [39.606908488885125]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs) It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities. Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z)
Spatial Reasoning and Planning for Deep Embodied Agents [2.7195102129095003]
This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks. It focuses on enhancing learning efficiency, interpretability, and transferability across novel scenarios.
arXiv Detail & Related papers (2024-09-28T23:05:56Z)
Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning [94.76546523689113]
We introduce CodePlan, a framework that generates and follows textcode-form plans -- pseudocode that outlines high-level, structured reasoning processes. CodePlan effectively captures the rich semantics and control flows inherent to sophisticated reasoning tasks. It achieves a 25.1% relative improvement compared with directly generating responses.
arXiv Detail & Related papers (2024-09-19T04:13:58Z)
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs) One understudied capability inVLMs is visual spatial planning. Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs) We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z)
Unlocking Large Language Model's Planning Capabilities with Maximum Diversity Fine-tuning [10.704716790096498]
Large language models (LLMs) have demonstrated impressive task-solving capabilities, achieved through either prompting techniques or system designs. This paper investigates the impact of fine-tuning on LLMs' planning capabilities. We propose the Maximum Diversity Fine-Tuning (MDFT) strategy to improve the sample efficiency of fine-tuning in the planning domain.
arXiv Detail & Related papers (2024-06-15T03:06:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.