Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents
- URL: http://arxiv.org/abs/2506.04649v1
- Date: Thu, 05 Jun 2025 05:31:50 GMT
- Title: Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents
- Authors: Juhyun Oh, Eunsu Kim, Alice Oh,
- Abstract summary: We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios.<n>Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings.
- Score: 16.295418365993033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at https://github.com/juhyunohh/FlexTravelBench.
Related papers
- PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving [66.42260489147617]
We introduce PLAN-TUNING, a framework that distills synthetic task decompositions from large-scale language models.<n>Plan-TUNING fine-tunes smaller models via supervised and reinforcement-learning objectives to improve complex reasoning.<n>Our analysis demonstrates how planning trajectories improves complex reasoning capabilities.
arXiv Detail & Related papers (2025-07-10T07:30:44Z) - Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM [58.50687282180444]
Travel planning is a complex task requiring the integration of diverse real-world information and user preferences.<n>We formulate this as an $L3$ planning problem, emphasizing long context, long instruction, and long output.<n>We introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems.
arXiv Detail & Related papers (2025-06-14T09:37:59Z) - Reinforced Reasoning for Embodied Planning [18.40186665383579]
Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals.<n>We introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning.
arXiv Detail & Related papers (2025-05-28T07:21:37Z) - World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning [60.100794160682646]
We propose a new learning framework that jointly optimize state prediction and action selection through preference learning.<n>To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error.<n>Our method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B)
arXiv Detail & Related papers (2025-03-13T15:49:56Z) - On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks.
Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z) - Exploring and Benchmarking the Planning Capabilities of Large Language Models [57.23454975238014]
This work lays the foundations for improving planning capabilities of large language models (LLMs)
We construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios.
We investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance.
arXiv Detail & Related papers (2024-06-18T22:57:06Z) - TRIP-PAL: Travel Planning with Guarantees by Combining Large Language Models and Automated Planners [6.378824981027464]
Traditional approaches rely on problem formulation in a given formal language.
Recent Large Language Model (LLM) based approaches directly output plans from user requests using language.
We propose TRIP-PAL, a hybrid method that combines the strengths of LLMs and automated planners.
arXiv Detail & Related papers (2024-06-14T17:31:16Z) - Can Vehicle Motion Planning Generalize to Realistic Long-tail Scenarios? [11.917542484123134]
Real-world autonomous driving systems must make safe decisions in the face of rare and diverse traffic scenarios.
Current state-of-the-art planners are mostly evaluated on real-world datasets like nuScenes (open-loop) or nuPlan (closed-loop)
arXiv Detail & Related papers (2024-04-11T08:57:48Z) - AdaPlanner: Adaptive Planning from Feedback with Language Models [56.367020818139665]
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks.
We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback.
To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities.
arXiv Detail & Related papers (2023-05-26T05:52:27Z) - Sequence-Based Plan Feasibility Prediction for Efficient Task and Motion
Planning [36.300564378022315]
We present a learning-enabled Task and Motion Planning (TAMP) algorithm for solving mobile manipulation problems in environments with many articulated and movable obstacles.
The core of our algorithm is PIGINet, a novel Transformer-based learning method that takes in a task plan, the goal, and the initial state, and predicts the probability of finding motion trajectories associated with the task plan.
arXiv Detail & Related papers (2022-11-03T04:12:04Z) - Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic
Prior [135.78858513845233]
STRIVE is a method to automatically generate challenging scenarios that cause a given planner to produce undesirable behavior, like collisions.
To maintain scenario plausibility, the key idea is to leverage a learned model of traffic motion in the form of a graph-based conditional VAE.
A subsequent optimization is used to find a "solution" to the scenario, ensuring it is useful to improve the given planner.
arXiv Detail & Related papers (2021-12-09T18:03:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.