COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization
- URL: http://arxiv.org/abs/2510.07043v1
- Date: Wed, 08 Oct 2025 14:09:46 GMT
- Title: COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization
- Authors: Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao,
- Abstract summary: We introduce a benchmark that evaluates agents on realistic travel-planning scenarios.<n>We build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks.<n>We uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks.
- Score: 47.26757420020116
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent's ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.
Related papers
- RecNet: Self-Evolving Preference Propagation for Agentic Recommender Systems [109.9061591263748]
RecNet is a self-evolving preference propagation framework for recommender systems.<n>It proactively propagates real-time preference updates across related users and items.<n>In the backward phase, the feedback-driven propagation optimization mechanism simulates a multi-agent reinforcement learning framework.
arXiv Detail & Related papers (2026-01-29T12:14:31Z) - LAPPI: Interactive Optimization with LLM-Assisted Preference-Based Problem Instantiation [6.8772471411888425]
We introduce LAPPI (LLM-Assisted Preference-based Problem Instantiation), an interactive approach that uses large language models (LLMs) to support users in this instantiation process.<n>In a user study on trip planning, our method successfully captured user preferences and generated feasible plans that outperformed both conventional and prompt-engineering approaches.
arXiv Detail & Related papers (2025-12-16T06:43:38Z) - In-the-Flow Agentic System Optimization for Effective Planning and Tool Use [73.72524040856052]
AgentFlow is a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory.<n>Flow-GRPO tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates.<n>AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks.
arXiv Detail & Related papers (2025-10-07T05:32:44Z) - ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning [53.065247112514534]
ATLAS is a general multi-agent framework designed to handle complex nature of constraints awareness in real-world travel planning tasks.<n>We demonstrate state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative.
arXiv Detail & Related papers (2025-09-29T23:23:52Z) - Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM [58.50687282180444]
Travel planning is a complex task requiring the integration of diverse real-world information and user preferences.<n>We formulate this as an $L3$ planning problem, emphasizing long context, long instruction, and long output.<n>We introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems.
arXiv Detail & Related papers (2025-06-14T09:37:59Z) - Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents [16.295418365993033]
We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios.<n>Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings.
arXiv Detail & Related papers (2025-06-05T05:31:50Z) - Vaiage: A Multi-Agent Solution to Personalized Travel Planning [0.27309692684728615]
Planning trips is a cognitively intensive task involving conflicting user preferences, dynamic external information, and multi-step temporal-spatial optimization.<n>Our approach, Vaiage, addresses these challenges through a graph-structured multi-agent framework built around large language models (LLMs) that serve as both goal-conditioned recommenders and sequential planners.<n>Through natural language interaction, structured tool use, and map-based feedback loops, Vaiage enables adaptive, explainable, and end-to-end travel planning grounded in both symbolic reasoning and conversational understanding.
arXiv Detail & Related papers (2025-05-16T06:54:52Z) - A Survey on the Optimization of Large Language Model-based Agents [16.733092886211097]
Large Language Models (LLMs) have been widely adopted in various fields, becoming essential for autonomous decision-making and interactive tasks.<n>However, current work typically relies on prompt design or fine-tuning strategies applied to vanilla LLMs.<n>We provide a comprehensive review of LLM-based agent optimization approaches, categorizing them into parameter-driven and parameter-free methods.
arXiv Detail & Related papers (2025-03-16T10:09:10Z) - World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning [60.100794160682646]
We propose a new learning framework that jointly optimize state prediction and action selection through preference learning.<n>To automatically collect trajectories and stepwise preference data without human annotation, we introduce a tree search mechanism for extensive exploration via trial-and-error.<n>Our method significantly outperforms existing methods and GPT-4o when applied to Qwen2-VL (7B), LLaVA-1.6 (7B), and LLaMA-3.2 (11B)
arXiv Detail & Related papers (2025-03-13T15:49:56Z) - Interactive Joint Planning for Autonomous Vehicles [19.479300967537675]
In interactive driving scenarios, the actions of one agent greatly influences those of its neighbors.
We present Interactive Joint Planning (IJP) that bridges MPC with learned prediction models.
IJP significantly outperforms the baselines that are either without joint optimization or running sampling-based planning.
arXiv Detail & Related papers (2023-10-27T17:48:25Z) - Optimal Cost-Preference Trade-off Planning with Multiple Temporal Tasks [3.655021726150368]
We introduce a novel notion of preference that provides a generalized framework to express preferences over individual tasks as well as their relations.
We perform an optimal trade-off (Pareto) analysis between behaviors that adhere to the user's preference and the ones that are resource optimal.
arXiv Detail & Related papers (2023-06-22T21:56:49Z) - Generating Useful Accident-Prone Driving Scenarios via a Learned Traffic
Prior [135.78858513845233]
STRIVE is a method to automatically generate challenging scenarios that cause a given planner to produce undesirable behavior, like collisions.
To maintain scenario plausibility, the key idea is to leverage a learned model of traffic motion in the form of a graph-based conditional VAE.
A subsequent optimization is used to find a "solution" to the scenario, ensuring it is useful to improve the given planner.
arXiv Detail & Related papers (2021-12-09T18:03:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.