Fugu-MT 論文翻訳(概要): TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

論文の概要: TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

arxiv url: http://arxiv.org/abs/2606.01046v1
Date: Sun, 31 May 2026 06:29:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.167212
Title: TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents
Title（参考訳）: TravelEval: LLMによる旅行計画エージェント評価のための総合的なベンチマークフレームワーク
Authors: Weiyi Chen, Shuaixiong Wang, Ziyun Gao, Kaichun Hu, Wangze Ni, Shimin Di, Chen Jason Zhang, Lei Chen,
Abstract要約: 本研究では,大規模言語旅行モデルを評価するための,現実的で包括的なベンチマークであるTravelEvalを紹介する。 TravelEval 1) 正確性, コンプライアンス, 時間性, 空間性, 経済性, 実用性といった面から計画を評価する新しい6次元評価フレームワーク。
参考スコア（独自算出の注目度）: 16.732203115366584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks' limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan's evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.
Abstract（参考訳）: 大規模言語モデル(LLM)の開発は、旅行計画アプリケーションを大幅に改善したが、そのようなモデルの評価は、既存のベンチマークの制限によって制限されている。 1) 制約遵守の過度な強調,時空間費用等の多次元的品質の無視 2) 重要領域(例えば、宿泊、輸送)における現実の真正性や網羅性に欠けるデータセット、及び 3) 計画全体の評価には,重要な詳細(例えば,毎日の宿泊施設や訪問スペースの影響)を欠いた日常的計画評価の分離が必要であった。このギャップに対処するために、現実的で包括的なベンチマークであるTravelEvalを紹介します。 TravelEval の特徴 1 正確性、遵守性、時間性、空間性、経済性及び実用性にまたがる計画を評価するための新しい6次元評価枠組み 2 正確な宿泊料金及び都市間交通データを有する高度に現実的なデータサンドボックス 3)API統合された地理的情報と細粒度待ち時間で完全な旅行計画をエミュレートするシミュレーションに基づくグローバルな評価手法を提案する。 TravelEvalによる12の主流アプローチを評価することは、LLMがグローバルに最適化された多次元計画(特に時空間推論と予算順守)に苦しむことや、エージェント推論戦略が一貫した改善をもたらすことなど、いくつかの貴重な洞察を浮き彫りにしている。正確には、TravelEvalは時空間エミュレーションと包括的メトリクスによる旅行計画評価を促進し、LSMによる旅行計画研究と応用を推進するための堅牢な基盤を提供する。

論文の概要: TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

関連論文リスト