Fugu-MT 論文翻訳(概要): Benchmarking LLM Tool-Use in the Wild

論文の概要: Benchmarking LLM Tool-Use in the Wild

arxiv url: http://arxiv.org/abs/2604.06185v1
Date: Fri, 13 Feb 2026 08:55:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.382589
Title: Benchmarking LLM Tool-Use in the Wild
Title（参考訳）: LLMツールの野生でのベンチマーク
Authors: Peijie Yu, Wei Liu, Yifan Yang, Jinjian Li, Zelong Zhang, Xiao Feng, Feng Zhang,
Abstract要約: 実際のユーザインタラクションは本質的にワイルドで、複雑で、乱雑で、柔軟です。我々は,ツールコールトポロジの効率的なオーケストレーションを必要とする構成タスク,対話のターンに広がる暗黙の意図,命令遷移という,ユーザの行動から3つの重要な課題を識別する。 WildToolBenchは,実世界のユーザ行動パターンをベースとしたLLMツール使用ベンチマークである。
参考スコア（独自算出の注目度）: 10.664145474355445
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fulfilling user needs through Large Language Model multi-turn, multi-step tool-use is rarely a straightforward process. Real user interactions are inherently wild, being intricate, messy, and flexible. We identify three key challenges from user behaviour: compositional tasks that demand efficient orchestration of tool-call topologies, implicit intent spread across dialogue turns that require contextual inference, and instruction transition, which mixes task queries, clarifications, and casual conversation, forcing LLMs to adjust their policies on the fly. Existing benchmarks overlook these behaviors, making the apparent progress of LLMs on tool-use spurious. To address this, we introduce WildToolBench, an LLM tool-use benchmark grounded in real-world user behavior patterns. Comprehensive evaluations of 57 LLMs reveal that no model achieves an accuracy of more than 15%, indicating a substantial gap in the robustness of LLMs' agentic ability. Controlled experiments and in-depth analyses further indicate that the real challenge for LLM tool-use lies not in artificially complex tasks, but in the wild nature of user behavior, emphasizing the need to reconsider the interactions among LLMs, users, and tools.
Abstract（参考訳）: ユーザニーズをLarge Language Modelのマルチターンを通じてフルフィルすることは、多段階ツールの使用が簡単なプロセスであることはめったにない。実際のユーザインタラクションは本質的にワイルドで、複雑で、乱雑で、柔軟です。ツールコールトポロジの効率的なオーケストレーションを必要とする構成的タスク、コンテキスト推論を必要とする対話に広がる暗黙の意図、タスククエリ、明確化、カジュアルな会話を混在させた命令遷移、LLMのポリシーを即時に調整せざるを得ない、という3つの課題をユーザ行動から特定する。既存のベンチマークはこれらの挙動を見落としており、ツール利用におけるLCMの明らかな進歩を刺激している。これを解決するために,実世界のユーザ行動パターンをベースとしたLLMツール使用ベンチマークであるWildToolBenchを紹介した。 57 LLMの総合的な評価では、モデルが15%以上の精度を達成することはなく、LLMのエージェント能力の堅牢性にかなりのギャップがあることが示されている。制御された実験と詳細な分析により、LLMツールの使用の真の課題は、人工的な複雑なタスクではなく、ユーザ行動の荒々しい性質にあることが明らかとなり、LLM、ユーザ、ツール間の相互作用を再考する必要性を強調している。

論文の概要: Benchmarking LLM Tool-Use in the Wild

関連論文リスト