WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
- URL: http://arxiv.org/abs/2602.08367v1
- Date: Mon, 09 Feb 2026 08:03:30 GMT
- Title: WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
- Authors: Zexuan Wang, Chenghao Yang, Yingqi Que, Zhenzhu Yang, Huaqing Yuan, Yiwen Wang, Zhengxuan Jiang, Shengjie Fang, Zhenhe Wu, Zhaohui Wang, Zhixin Yao, Jiashuo Liu, Jincheng Ren, Yuzhen Li, Yang Yang, Jiaheng Liu, Jian Yang, Zaiyuan Wang, Ge Zhang, Zhoufutu Wen, Wenhao Huang,
- Abstract summary: Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions.<n>We introduce textbfWorldTravel, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints.<n>To evaluate agents in realistic deployments, we develop textbfWorldTravel-Webscape, a multi-modal environment featuring over 2,000 rendered webpages.
- Score: 43.573740013433394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.
Related papers
- Optimization-Guided Diffusion for Interactive Scene Generation [52.23368750264419]
We present OMEGA, an optimization-guided, training-free framework that enforces structural consistency and interaction awareness during diffusion-based sampling.<n>We show that OMEGA improves generation realism, consistency, and controllability, increasing the ratio of physically and behaviorally valid scenes.<n>Our approach can also generate $5times$ more near-collision frames with a time-to-collision under three seconds.
arXiv Detail & Related papers (2025-12-08T15:56:18Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - ATLAS: Constraints-Aware Multi-Agent Collaboration for Real-World Travel Planning [53.065247112514534]
ATLAS is a general multi-agent framework designed to handle complex nature of constraints awareness in real-world travel planning tasks.<n>We demonstrate state-of-the-art performance on the TravelPlanner benchmark, improving the final pass rate from 23.3% to 44.4% over its best alternative.
arXiv Detail & Related papers (2025-09-29T23:23:52Z) - UnLoc: Leveraging Depth Uncertainties for Floorplan Localization [80.55849461031879]
UnLoc is an efficient data-driven solution for sequential camera localization within floorplans.<n>We introduce a novel probabilistic model that incorporates uncertainty estimation, modeling depth predictions as explicit probability distributions.<n>We evaluate UnLoc on large-scale synthetic and real-world datasets, demonstrating significant improvements in terms of accuracy and robustness.
arXiv Detail & Related papers (2025-09-14T14:45:43Z) - RETAIL: Towards Real-world Travel Planning for Large Language Models [36.75531019697594]
We present a novel dataset textbfRETAIL, which supports decision-making for implicit queries while covering explicit queries.<n>It also enables environmental awareness to ensure plan feasibility under real-world scenarios, while incorporating detailed POI information for all-in-one travel plans.<n>Our experiments reveal that even the strongest existing model achieves merely a 1.0% pass rate, indicating real-world travel planning remains extremely challenging.
arXiv Detail & Related papers (2025-08-21T08:08:38Z) - TripTailor: A Real-World Benchmark for Personalized Travel Planning [28.965273870656446]
TripTailor is a benchmark for personalized travel planning in real-world scenarios.<n>This dataset features over 500,000 real-world points of interest (POIs) and nearly 4,000 diverse travel itineraries.<n>We identify several critical challenges in travel planning, including the feasibility, rationality, and personalized customization.
arXiv Detail & Related papers (2025-08-02T16:44:02Z) - Foundation Models for Logistics: Toward Certifiable, Conversational Planning Interfaces [59.80143393787701]
Large language models (LLMs) can handle uncertainty and promise to accelerate replanning while lowering the barrier to entry.<n>We introduce a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on goal interpretation.<n>A lightweight model, fine-tuned on just 100 uncertainty-filtered examples, surpasses the zero-shot performance of GPT-4.1 while cutting inference latency by nearly 50%.
arXiv Detail & Related papers (2025-07-15T14:24:01Z) - ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning [38.44879526364259]
We introduce emphChinaTravel, the first open-ended benchmark grounded in authentic Chinese travel requirements.<n>We design a compositionally generalizable domain-specific language for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison.<n> Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0% constraint satisfaction rate on human queries.
arXiv Detail & Related papers (2024-12-18T10:10:12Z) - Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools [12.875270710153021]
Large Language Models (LLMs) struggle to directly generate correct plans for complex multi-constraint planning problems.<n>We propose an LLM-based planning framework that formalizes and solves complex multi-constraint planning problems.<n>Our framework achieves a success rate of 93.9% and is effective with diverse paraphrased prompts.
arXiv Detail & Related papers (2024-04-18T04:36:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.