Related papers: ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning

ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning

URL: http://arxiv.org/abs/2412.13682v3
Date: Fri, 30 May 2025 13:35:50 GMT
Title: ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning
Authors: Jie-Jing Shao, Bo-Wen Zhang, Xiao-Wen Yang, Baizhi Chen, Si-Yu Han, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-feng Li,
Abstract summary: We introduce emphChinaTravel, the first open-ended benchmark grounded in authentic Chinese travel requirements.<n>We design a compositionally generalizable domain-specific language for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison.<n> Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0% constraint satisfaction rate on human queries.
Score: 49.37899519520761
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in LLMs, particularly in language reasoning and tool integration, have rapidly sparked the \emph{Language Agents} for real-world development. Among these, travel planning represents a prominent domain, combining complex multi-objective planning challenges with practical deployment demands. However, existing benchmarks often oversimplify real-world requirements by focusing on synthetic queries and limited constraints. We address the gap of evaluating language agents in multi-day, multi-POI travel planning scenarios with diverse and open human needs. Specifically, we introduce \emph{ChinaTravel}, the first open-ended benchmark grounded in authentic Chinese travel requirements collected from 1,154 human participants. We design a compositionally generalizable domain-specific language (DSL) for scalable evaluation, covering feasibility, constraint satisfaction, and preference comparison. Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a 37.0\% constraint satisfaction rate on human queries, a 10\times improvement over purely neural models. These findings highlight ChinaTravel as a pivotal milestone for advancing language agents in complex, real-world planning scenarios.

Related papers

TripTailor: A Real-World Benchmark for Personalized Travel Planning [28.965273870656446]
TripTailor is a benchmark for personalized travel planning in real-world scenarios.<n>This dataset features over 500,000 real-world points of interest (POIs) and nearly 4,000 diverse travel itineraries.<n>We identify several critical challenges in travel planning, including the feasibility, rationality, and personalized customization.
arXiv Detail & Related papers (2025-08-02T16:44:02Z)
Foundation Models for Logistics: Toward Certifiable, Conversational Planning Interfaces [59.80143393787701]
Large language models (LLMs) can handle uncertainty and promise to accelerate replanning while lowering the barrier to entry.<n>We introduce a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on goal interpretation.<n>A lightweight model, fine-tuned on just 100 uncertainty-filtered examples, surpasses the zero-shot performance of GPT-4.1 while cutting inference latency by nearly 50%.
arXiv Detail & Related papers (2025-07-15T14:24:01Z)
Plan Your Travel and Travel with Your Plan: Wide-Horizon Planning and Evaluation via LLM [58.50687282180444]
Travel planning is a complex task requiring the integration of diverse real-world information and user preferences.<n>We formulate this as an $L3$ planning problem, emphasizing long context, long instruction, and long output.<n>We introduce Multiple Aspects of Planning (MAoP), enabling LLMs to conduct wide-horizon thinking to solve complex planning problems.
arXiv Detail & Related papers (2025-06-14T09:37:59Z)
TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning [39.934634038758404]
This paper introduces TP-RAG, the first benchmark tailored retrieval-augmentedtemporalRAG-aware travel planning. Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain POIs, 18,784 annotated POIs.
arXiv Detail & Related papers (2025-04-11T17:02:40Z)
TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning [7.841787597078323]
TripCraft establishes a new benchmark for LLM driven personalized travel planning, offering a more realistic, constraint aware framework for itinerary generation. Our parameter informed setting significantly enhances meal scheduling, improving the Temporal Meal Score from 61% to 80% in a 7 day scenario.
arXiv Detail & Related papers (2025-02-27T20:33:28Z)
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios [53.26658545922884]
We introduce EgoPlan-Bench2, a benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios. We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning. Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training.
arXiv Detail & Related papers (2024-12-05T18:57:23Z)
To the Globe (TTG): Towards Language-Driven Guaranteed Travel Planning [54.9340658451129]
To the Globe (TTG) is a real-time demo system that takes natural language requests from users and translates it to symbolic form. The overall system takes 5 seconds to reply to the user request with guaranteed itineraries. When evaluated by users, TTG achieves consistently high Net Promoter Scores (NPS) of 35-40% on generated itinerary.
arXiv Detail & Related papers (2024-10-21T19:30:05Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
Ask-before-Plan: Proactive Language Agents for Real-World Planning [68.08024918064503]
Proactive Agent Planning requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction. We propose a novel multi-agent framework, Clarification-Execution-Planning (textttCEP), which consists of three agents specialized in clarification, execution, and planning.
arXiv Detail & Related papers (2024-06-18T14:07:28Z)
Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools [12.875270710153021]
Large Language Models (LLMs) still struggle to directly generate correct plans for complex multi-constraint planning problems. We propose an LLM-based planning framework that formalizes and solves complex multi-constraint planning problems as constrained satisfiability problems. We show that our framework can modify and solve for an average of 81.6% and 91.7% unsatisfiable queries from two datasets.
arXiv Detail & Related papers (2024-04-18T04:36:37Z)
TravelPlanner: A Benchmark for Real-World Planning with Language Agents [63.199454024966506]
We propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario. It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans. Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%.
arXiv Detail & Related papers (2024-02-02T18:39:51Z)
DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation [107.5934592892763]
We propose DREAMWALKER -- a world model based VLN-CE agent. The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment. It can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions.
arXiv Detail & Related papers (2023-08-14T23:45:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.