Related papers: TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

URL: http://arxiv.org/abs/2512.22673v2
Date: Mon, 05 Jan 2026 13:19:13 GMT
Title: TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning
Authors: Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Zheng Pan, Xin Li, Yong Liu,
Abstract summary: Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities.<n>TravelBench is a benchmark for fully real-world travel planning.
Score: 22.3041021610283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of clear evaluation of agents' capability boundaries. To mitigate these gaps, we propose \textbf{TravelBench}, a benchmark for fully real-world travel planning. We collect user queries, user profile and tools from real scenarios, and construct three subtasks-Single-Turn, Multi-Turn, and Unsolvable-to evaluate agent's three core capabilities in real settings: (1) solving problems autonomously, (2) interacting with users over multiple turns to refine requirements, and (3) recognizing the limits of own abilities. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment that integrates ten travel-related tools. Agents can combine these tools to solve most practical travel planning problems, and our systematic verification demonstrates the stability of the proposed benchmark. We further evaluate multiple LLMs on TravelBench and conduct an in-depth analysis of their behaviors and performance. TravelBench provides a practical and reproducible evaluation benchmark to advance research on LLM agents for travel planning.\footnote{Our code and data will be available after internal review.

Related papers

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios [34.570930885283694]
We introduce MobilityBench, a benchmark for evaluating large language models (LLMs)-based route-planning agents in real-world mobility scenarios.<n> MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap.<n>We propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency.
arXiv Detail & Related papers (2026-02-26T05:39:38Z)
Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents [52.30603055218294]
Trajectory2Task is a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios.<n>It converts valid tool-call trajectories into user-facing tasks with controlled intent adaptations.<n>We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures.
arXiv Detail & Related papers (2026-01-28T00:36:13Z)
TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use [74.47746287181383]
Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks.<n>We introduce TRAJECT-Bench, a trajectory-aware benchmark to comprehensively evaluate LLMs' tool use capability.
arXiv Detail & Related papers (2025-10-06T07:30:25Z)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications [20.065087936770215]
We introduce VitaBench, a benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings.<n>VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools.<n>Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks.
arXiv Detail & Related papers (2025-09-30T16:33:49Z)
DeepTravel: An End-to-End Agentic Reinforcement Learning Framework for Autonomous Travel Planning Agents [26.786926580388325]
Travel planning (TP) agent has recently worked as an emerging building block to interact with external tools and resources for travel itinerary generation.<n>This paper proposes DeepTravel, an end to end agentic reinforcement learning framework for building autonomous travel planning agent.
arXiv Detail & Related papers (2025-09-26T04:03:52Z)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks [64.86209459039313]
ThinkGeo is an agentic benchmark designed to evaluate tool-augmented agents on remote sensing tasks via structured tool use and multi-step planning.<n>We implement a ReAct-style interaction loop and evaluate both open and closed-source LLMs on 486 structured agentic tasks with 1,773 expert-verified reasoning steps.<n>Our analysis reveals notable disparities in tool accuracy and planning consistency across models.
arXiv Detail & Related papers (2025-05-29T17:59:38Z)
FamilyTool: A Multi-hop Personalized Tool Use Benchmark [93.80355496575281]
FamilyTool is a benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios.<n> Experiments reveal significant performance gaps in state-of-the-art Large Language Models (LLMs)<n>FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments.
arXiv Detail & Related papers (2025-04-09T10:42:36Z)
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions [12.218102495632937]
Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities.<n>We propose the Multi-Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions.<n>We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees.
arXiv Detail & Related papers (2025-04-03T14:21:33Z)
ACEBench: Who Wins the Match Point in Tool Usage? [86.79310356779108]
ACEBench is a comprehensive benchmark for assessing tool usage in Large Language Models (LLMs)<n>It categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent.<n>It provides a more granular examination of error causes across different data types.
arXiv Detail & Related papers (2025-01-22T12:59:08Z)
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios [53.26658545922884]
We introduce EgoPlan-Bench2, a benchmark designed to assess the planning capabilities of MLLMs across a wide range of real-world scenarios.<n>We evaluate 21 competitive MLLMs and provide an in-depth analysis of their limitations, revealing that they face significant challenges in real-world planning.<n>Our approach enhances the performance of GPT-4V by 10.24 on EgoPlan-Bench2 without additional training.
arXiv Detail & Related papers (2024-12-05T18:57:23Z)
GTA: A Benchmark for General Tool Agents [32.443456248222695]
We design 229 real-world tasks and executable tool chains to evaluate mainstream large language models (LLMs) Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents.
arXiv Detail & Related papers (2024-07-11T17:50:09Z)
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.