Fugu-MT 論文翻訳(概要): TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning

論文の概要: TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning

arxiv url: http://arxiv.org/abs/2512.22673v1
Date: Sat, 27 Dec 2025 18:25:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-30 22:37:30.172332
Title: TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning
Title（参考訳）: TravelBench: マルチTurnとツール拡張トラベル計画のための実世界のベンチマーク
Authors: Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Zheng Pan, Xin Li, Yong Liu,
Abstract要約: 大規模言語モデル(LLM)エージェントは、計画とツールの使用において強力な能力を示している。旅行プランニングは、これらの機能に対する自然かつ高インパクトなテストベッドを提供する。本稿では,マルチターンインタラクションとツール利用を特徴とする実世界の旅行計画ベンチマークであるTravelBenchを紹介する。
参考スコア（独自算出の注目度）: 22.3041021610283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language model (LLM) agents have demonstrated strong capabilities in planning and tool use. Travel planning provides a natural and high-impact testbed for these capabilities, as it requires multi-step reasoning, iterative preference elicitation through interaction, and calls to external tools under evolving constraints. Prior work has studied LLMs on travel-planning tasks, but existing settings are limited in domain coverage and multi-turn interaction. As a result, they cannot support dynamic user-agent interaction and therefore fail to comprehensively assess agent capabilities. In this paper, we introduce TravelBench, a real-world travel-planning benchmark featuring multi-turn interaction and tool use. We collect user requests from real-world scenarios and construct three subsets-multi-turn, single-turn, and unsolvable-to evaluate different aspects of agent performance. For stable and reproducible evaluation, we build a controlled sandbox environment with 10 travel-domain tools, providing deterministic tool outputs for reliable reasoning. We evaluate multiple LLMs on TravelBench and conduct an analysis of their behaviors and performance. TravelBench offers a practical and reproducible benchmark for advancing LLM agents in travel planning.
Abstract（参考訳）: 大規模言語モデル(LLM)エージェントは、計画とツールの使用において強力な能力を示している。トラベルプランニングは、これらの機能に対して自然でインパクトの高いテストベッドを提供する。多段階の推論、インタラクションによる反復的な選好の誘導、進化する制約の下で外部ツールを呼び出すことが必要である。これまで、旅行計画タスクのLLMについて研究してきたが、既存の設定はドメインカバレッジとマルチターンインタラクションに限られていた。結果として、動的ユーザエージェントのインタラクションをサポートできないため、エージェント機能を包括的に評価することができない。本稿では,マルチターンインタラクションとツール利用を特徴とする実世界の旅行計画ベンチマークであるTravelBenchを紹介する。実世界のシナリオからユーザリクエストを収集し、エージェントのパフォーマンスの異なる側面を評価するために、マルチターン、シングルターン、未解決の3つのサブセットを構築します。安定かつ再現可能な評価を行うため、10の旅行ドメインツールによる制御されたサンドボックス環境を構築し、信頼性の高い推論のための決定論的ツール出力を提供する。本研究では,TravelBench 上で複数の LLM を評価し,その挙動と性能を解析する。 TravelBenchは、旅行計画におけるLLMエージェントの進歩のための実用的で再現可能なベンチマークを提供する。

論文の概要: TravelBench: A Real-World Benchmark for Multi-Turn and Tool-Augmented Travel Planning

関連論文リスト