ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning
- URL: http://arxiv.org/abs/2412.13682v2
- Date: Fri, 20 Dec 2024 15:08:25 GMT
- Title: ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning
- Authors: Jie-Jing Shao, Xiao-Wen Yang, Bo-Wen Zhang, Baizhi Chen, Wen-Da Wei, Guohao Cai, Zhenhua Dong, Lan-Zhe Guo, Yu-feng Li,
- Abstract summary: We introduce ChinaTravel, a benchmark specifically designed for authentic Chinese travel planning scenarios.<n>We collect the travel requirements from questionnaires and propose a compositionally generalizable domain-specific language.<n> Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a constraint satisfaction rate of 27.9%.<n>We identify key challenges in real-world travel planning deployments, including open language reasoning and unseen concept composition.
- Score: 50.7898120693695
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in LLMs, particularly in language reasoning and tool integration, have rapidly sparked the real-world development of Language Agents. Among these, travel planning represents a prominent domain, combining academic challenges with practical value due to its complexity and market demand. However, existing benchmarks fail to reflect the diverse, real-world requirements crucial for deployment. To address this gap, we introduce ChinaTravel, a benchmark specifically designed for authentic Chinese travel planning scenarios. We collect the travel requirements from questionnaires and propose a compositionally generalizable domain-specific language that enables a scalable evaluation process, covering feasibility, constraint satisfaction, and preference comparison. Empirical studies reveal the potential of neuro-symbolic agents in travel planning, achieving a constraint satisfaction rate of 27.9%, significantly surpassing purely neural models at 2.6%. Moreover, we identify key challenges in real-world travel planning deployments, including open language reasoning and unseen concept composition. These findings highlight the significance of ChinaTravel as a pivotal milestone for advancing language agents in complex, real-world planning scenarios.
Related papers
- TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning [39.934634038758404]
This paper introduces TP-RAG, the first benchmark tailored retrieval-augmentedtemporalRAG-aware travel planning.
Our dataset includes 2,348 real-world travel queries, 85,575 fine-grain POIs, 18,784 annotated POIs.
arXiv Detail & Related papers (2025-04-11T17:02:40Z) - TripCraft: A Benchmark for Spatio-Temporally Fine Grained Travel Planning [7.841787597078323]
TripCraft establishes a new benchmark for LLM driven personalized travel planning, offering a more realistic, constraint aware framework for itinerary generation.
Our parameter informed setting significantly enhances meal scheduling, improving the Temporal Meal Score from 61% to 80% in a 7 day scenario.
arXiv Detail & Related papers (2025-02-27T20:33:28Z) - EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios [53.26658545922884]
We introduce EgoPlan-Bench2, a benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios.
We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning.
Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training.
arXiv Detail & Related papers (2024-12-05T18:57:23Z) - To the Globe (TTG): Towards Language-Driven Guaranteed Travel Planning [54.9340658451129]
To the Globe (TTG) is a real-time demo system that takes natural language requests from users and translates it to symbolic form.
The overall system takes 5 seconds to reply to the user request with guaranteed itineraries.
When evaluated by users, TTG achieves consistently high Net Promoter Scores (NPS) of 35-40% on generated itinerary.
arXiv Detail & Related papers (2024-10-21T19:30:05Z) - Ask-before-Plan: Proactive Language Agents for Real-World Planning [68.08024918064503]
Proactive Agent Planning requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction.
We propose a novel multi-agent framework, Clarification-Execution-Planning (textttCEP), which consists of three agents specialized in clarification, execution, and planning.
arXiv Detail & Related papers (2024-06-18T14:07:28Z) - Large Language Models Can Solve Real-World Planning Rigorously with Formal Verification Tools [12.875270710153021]
Large Language Models (LLMs) still struggle to directly generate correct plans for complex multi-constraint planning problems.
We propose an LLM-based planning framework that formalizes and solves complex multi-constraint planning problems as constrained satisfiability problems.
We show that our framework can modify and solve for an average of 81.6% and 91.7% unsatisfiable queries from two datasets.
arXiv Detail & Related papers (2024-04-18T04:36:37Z) - TravelPlanner: A Benchmark for Real-World Planning with Language Agents [63.199454024966506]
We propose TravelPlanner, a new planning benchmark that focuses on travel planning, a common real-world planning scenario.
It provides a rich sandbox environment, various tools for accessing nearly four million data records, and 1,225 meticulously curated planning intents and reference plans.
Comprehensive evaluations show that the current language agents are not yet capable of handling such complex planning tasks-even GPT-4 only achieves a success rate of 0.6%.
arXiv Detail & Related papers (2024-02-02T18:39:51Z) - DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation [107.5934592892763]
We propose DREAMWALKER -- a world model based VLN-CE agent.
The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment.
It can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions.
arXiv Detail & Related papers (2023-08-14T23:45:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.