Fugu-MT 論文翻訳(概要): Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

論文の概要: Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

arxiv url: http://arxiv.org/abs/2509.26553v1
Date: Tue, 30 Sep 2025 17:21:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.227485
Title: Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling
Title（参考訳）: 信頼性ベンチマークに向けて:マルチステップLCM関数呼び出しのための汚染のない制御可能な評価フレームワーク
Authors: Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour, Tom Mitchell, Estevam Hruschka,
Abstract要約: ツール拡張言語モデル(TaLM)を合成多段階タスクによって評価する,汚染のないフレームワークであるFuncBenchGenを提案する。推論最適化モデルはGPT-5で汎用モデルより一貫して優れており、他のモデルよりも大幅に優れていることを示す。強いモデルはしばしば構文的に有効な関数呼び出しを行うが、ステップ間で誤ったあるいは古い引数値を伝搬し、マルチターンツールの使用においてLLMによる不安定な状態追跡を明らかにする。
参考スコア（独自算出の注目度）: 16.396204092947496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As language models gain access to external tools via structured function calls, they become increasingly more capable of solving complex, multi-step tasks. However, existing benchmarks for tool-augmented language models (TaLMs) provide insufficient control over factors such as the number of functions accessible, task complexity, and input size, and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where nodes are function calls and an edge between nodes represents one function consuming the output of another. Given a set of external function schemas, initial variable values, and a target variable, models must compose the correct call sequence to compute the target variable. FuncBenchGen allows users to precisely control task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding data leakage. We apply our FuncBenchGen framework to evaluate seven LLMs on tool use tasks of varying difficulty. Reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other models. Performance declines sharply as dependency depth increases. Furthermore, connected irrelevant functions prove especially difficult to handle. We find that strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding a success rate improvement from 62.5% to 81.3% for GPT-5.
Abstract（参考訳）: 言語モデルが構造化関数呼び出しを通じて外部ツールにアクセスできるようになると、複雑なマルチステップタスクを解決する能力がますます高まっていく。しかし、ツール拡張言語モデル(TaLM)の既存のベンチマークでは、アクセス可能な関数の数、タスクの複雑さ、入力サイズなどの要因に対する制御が不十分であり、データ汚染に弱いままである。合成多段階ツール利用タスクを生成することにより,TaLMの評価を行う,統合された汚染のないフレームワークであるFuncBenchGenを提案する。キーとなるアイデアは、ノードが関数呼び出しであり、ノード間のエッジが別の関数の出力を消費する、隠れ関数依存性DAG上のトラバースとしてツールをキャストすることである。外部関数スキーマ、初期変数値、ターゲット変数のセットが与えられた場合、モデルはターゲット変数を計算するために正しい呼び出しシーケンスを構成する必要がある。 FuncBenchGenは、データ漏洩を避けながら、タスクの難しさ(例えば、グラフサイズ、依存性の深さ、イントラクタ関数)を正確に制御できる。我々はFuncBenchGenフレームワークをツール利用タスクにおける7つのLLMの評価に応用した。推論最適化モデルは、GPT-5で汎用モデルより一貫して優れており、他のモデルよりも大幅に優れていた。依存性の深さが増加するにつれて、パフォーマンスは急激に低下する。さらに、連結無関係関数は特に扱いが難しい。強いモデルはしばしば構文的に有効な関数呼び出しを行うが、ステップ間で誤ったあるいは古い引数値を伝搬し、マルチターンツールの使用においてLLMによる不安定な状態追跡を明らかにする。この観測により,各ステップにおいて,先行変数値をエージェントに明示的に再配置する単純な緩和戦略が導入された。驚くべきことに、この軽量な変更はモデル間でかなりの利益をもたらします。 GPT-5では62.5%から81.3%に改善された。

論文の概要: Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

関連論文リスト