Fugu-MT 論文翻訳(概要): A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

論文の概要: A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

arxiv url: http://arxiv.org/abs/2605.28556v2
Date: Tue, 02 Jun 2026 10:20:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 18:57:50.155154
Title: A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
Title（参考訳）: TASTEの課題:エージェントベンチマークのカバレッジと難易度の改善
Authors: Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz, Michal Shmueli-Scheuer, Roi Reichart,
Abstract要約: ツール・シークエンス・エボリューションによるタスク・シンセサイザー(TASTE: Task Synthesis from Tool Sequence Evolution)を提案する。 TASTEはクラスタリングを通じてプールから代表シーケンスを選択し、それらを完全なベンチマークタスクにインスタンス化し、難易度進化を通じてそれらを洗練する。以上の結果から,既存のベンチマークにおける高いスコアは,頑健なタスク解決能力よりも飽和度を反映していることが示唆された。
参考スコア（独自算出の注目度）: 25.713629634281077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As agent capabilities advance, existing benchmarks, such as $τ^2$-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive $n$-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct $τ^c$-Bench, a challenging extension of the three domains of $τ^2$-Bench. We evaluate $11$ agent/user LLM pairs and find that models nearly saturating $τ^2$-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from $0.82\!-\!0.94$ to $0.28\!-\!0.61$). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.
Abstract（参考訳）: エージェント能力の進歩に伴い、$τ^2$-Benchのような既存のベンチマークは飽和してきている。しかし、新しいベンチマークタスクの構築は、複雑でコストがかかり、労働集約的です。さらに、シナリオを自然言語で記述し、ツールシーケンスにマッピングする標準的なアプローチでは、ツール使用パターンエージェントの実行の狭いサブセットのみをキャプチャする。本稿では,タスク構築過程を逆転することで,これらの問題に対処する。ツール・シークエンス・エボリューションによるタスク・シンセサイザー(TASTE: Task Synthesis from Tool Sequence Evolution)を提案する。 TASTEは、LLM-judged妥当性信号に基づいて訓練されたAdaptive Contrastive $n$-gramモデルを利用する。これにより、さまざまなツールの組み合わせをカバーする有効なツールシーケンスのサンプリングが可能になる。その後、TASTEはクラスタリングを通じてプールから代表シーケンスを選択し、それらを完全なベンチマークタスクにインスタンス化し、反復的な難易度進化を通じて洗練する。 TASTEを用いて、$τ^c$-Benchという3つの領域を挑戦的に拡張する$τ^c$-Benchを構築する。エージェント/ユーザ LLM ペアを 11 ドル評価して,τ^2$-Bench の飽和に近いモデルでは,タスクのパフォーマンスが著しく低下する (例: Gemini-3-Flash が 0.2 から ! -\! 0.94$から0.28\! -\! 0.61ドル)。困難が増すだけでなく、生成したタスクは、エージェントが実行しなければならないユニークなツールの組み合わせの数を2倍以上にします。以上の結果から,既存のベンチマークにおける高いスコアは,頑健なタスク解決能力よりも飽和度を反映していることが示唆された。難しい高カバレッジベンチマークの自動生成によって、TASTEは、将来のエージェントの継続的かつスケーラブルな評価を可能にします。

論文の概要: A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

関連論文リスト