Fugu-MT 論文翻訳(概要): SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

論文の概要: SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

arxiv url: http://arxiv.org/abs/2606.10388v1
Date: Tue, 09 Jun 2026 03:54:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.309643
Title: SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval
Title（参考訳）: SkillResolve-Bench:エージェントスキル検索における同じ能力の曖昧性の測定と解決
Authors: Jiandong Ding,
Abstract要約: エージェントスキルライブラリは、不安定なソフトウェア資産になりつつある。レトリバーは、適切な機能ファミリを見つけることができるが、間違った同機能代表を公開できる。我々は,この障害を同機能実行リスク検索として検討する。
参考スコア（独自算出の注目度）: 3.7179887342776445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.
Abstract（参考訳）: 検索されたスキルは、命令、スクリプト、リソースバインディング、実行仮定をエージェントにコントリビュートすることができる。これにより、幅広い関連性マッチング以上のスキル検索が可能になる。レトリバーは、適切な機能ファミリを見つけることができるが、間違った同機能代表を公開できる。我々は,この障害を同機能実行リスク検索として検討する。各クエリは、機能ファミリを共有するが、古いリソースへの実行、事前条件の欠如、あるいは間違った手順を導出できるクエリ固有のリスクシブリングと、有用なスキルをペアリングする。 SkillResolve-Bench 1.0は、661組の補助/リスキーペア、ソースロールとインセプションエビデンス、キュー/レカシチェック、クエリ/ディスジョイントスプリット、および6,660名のパブリックSkillRet候補を含む7,982個の候補プールを備えた監査可能なベンチマークである。このベンチマークは、有害な兄弟姉妹率(HSR@K)とともに、リスクのある兄弟姉妹がトップKに露出するのに役立つと報告している。 SkillResolveは、アクティブな候補ファミリーを解決し、不確実なライブラリ陰性やコントラクトに注目するキューからクエリ条件付きユーティリティをスコアし、最終トップKリストの前に各ファミリーから1つの代表を選択するための参照方法である。リリースされた家族関係のもと、SkillResolveはRecall@3 0.766とNDCG@3 0.699に到達し、HSR@3=0を維持している。 SkillRouterを0.112 Recall@3と0.165 NDCG@3で改善し、HSR@3を0.693から0に削減した。代表選択がなければ、HSR@3は同じスコアで0.236まで上昇し、家族内の代表選択を、機能検索をより安全な手続き的露出に変換するメカニズムとして特定する。

論文の概要: SkillResolve-Bench: Measuring and Resolving Same-Capability Ambiguity in Agent Skill Retrieval

関連論文リスト