Fugu-MT 論文翻訳(概要): TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

論文の概要: TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

arxiv url: http://arxiv.org/abs/2512.22673v2
Date: Mon, 05 Jan 2026 13:19:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-06 14:31:43.628762
Title: TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning
Title（参考訳）: TravelBench: マルチTurnとツールを使った旅行計画のためのより広範な実世界のベンチマーク
Authors: Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Zheng Pan, Xin Li, Yong Liu,
Abstract要約: 旅行計画は、大規模言語モデル(LLM)計画とツール使用能力をテストするための自然な現実的なタスクである。 TravelBenchは、完全な現実世界の旅行計画のベンチマークである。
参考スコア（独自算出の注目度）: 22.3041021610283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Travel planning is a natural real-world task to test large language models (LLMs) planning and tool-use abilities. Although prior work has studied LLM performance on travel planning, existing settings still differ from real-world needs, mainly due to limited domain coverage, insufficient modeling of users' implicit preferences in multi-turn conversations, and a lack of clear evaluation of agents' capability boundaries. To mitigate these gaps, we propose \textbf{TravelBench}, a benchmark for fully real-world travel planning. We collect user queries, user profile and tools from real scenarios, and construct three subtasks-Single-Turn, Multi-Turn, and Unsolvable-to evaluate agent's three core capabilities in real settings: (1) solving problems autonomously, (2) interacting with users over multiple turns to refine requirements, and (3) recognizing the limits of own abilities. To enable stable tool invocation and reproducible evaluation, we cache real tool-call results and build a sandbox environment that integrates ten travel-related tools. Agents can combine these tools to solve most practical travel planning problems, and our systematic verification demonstrates the stability of the proposed benchmark. We further evaluate multiple LLMs on TravelBench and conduct an in-depth analysis of their behaviors and performance. TravelBench provides a practical and reproducible evaluation benchmark to advance research on LLM agents for travel planning.\footnote{Our code and data will be available after internal review.
Abstract（参考訳）: 旅行計画は、大規模言語モデル(LLM)計画とツール使用能力をテストするための自然な現実的なタスクである。これまでの研究では、旅行計画におけるLLMの性能について研究されてきたが、既存の設定は、ドメインカバレッジの制限、マルチターン会話におけるユーザの暗黙の好みのモデリングの不十分、エージェントの能力境界の明確な評価の欠如など、現実世界のニーズとは相変わらず異なる。これらのギャップを軽減するために,実世界の旅行計画のベンチマークであるtextbf{TravelBench}を提案する。我々は,実際のシナリオからユーザクエリ,ユーザプロファイル,ツールを収集し,エージェントの3つのコア機能を評価するための3つのサブタスク(Single-Turn, Multi-Turn, Unsolvable-to)を構築する。安定したツール呼び出しと再現可能な評価を可能にするため、実際のツールコール結果をキャッシュし、10つの旅行関連ツールを統合するサンドボックス環境を構築します。エージェントはこれらのツールを組み合わせることで,最も実用的な旅行計画問題の解決が可能になる。さらに,TravelBench上で複数のLLMを評価し,その挙動と性能を詳細に解析する。 TravelBenchは、旅行計画のためのLSMエージェントの研究を進めるために、実用的で再現可能な評価ベンチマークを提供する。内部レビューの後、コードとデータが利用可能になる。

論文の概要: TravelBench: A Broader Real-World Benchmark for Multi-Turn and Tool-Using Travel Planning

関連論文リスト