Fugu-MT 論文翻訳(概要): OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

論文の概要: OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

arxiv url: http://arxiv.org/abs/2605.23657v1
Date: Fri, 22 May 2026 14:09:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.384469
Title: OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
Title（参考訳）: OpenSkillEval: LLMエージェントのオープンスキルエコシステムを自動監査する
Authors: Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao,
Abstract要約: スキル強化エージェントシステムとスキル自体の自動評価フレームワークであるtextscOpenSkillEvalを提案する。静的ベンチマークに頼る代わりに、textscOpenSkillEvalは、現実世界のアーティファクトの進化から現実的なタスクインスタンスを自動的に構築する。 600以上の動的に生成されたタスクインスタンスと30のオープンソーススキルを使用して、最先端のモデルとエージェントフレームワークを体系的に評価する。
参考スコア（独自算出の注目度）: 15.598856888948093
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present \textsc{OpenSkillEval}, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, \textsc{OpenSkillEval} automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.
Abstract（参考訳）: 大規模言語モデル(LLM)のために蒸留された構造化ワークフロー命令(Structured workflow instructions)は、現実世界の下流タスクにおいてエージェントのパフォーマンスを改善するための重要なメカニズムになりつつある。しかし、オープンソースのスキルエコシステムが急速に拡大するにつれて、異なるモデルやエージェントフレームワークがスキルとどのように相互作用するか、スキルの品質をどのように評価するか、そしてユーザーが実用的なコストパフォーマンストレードオフの下でどのようにスキルを選択するべきかは、まだ不明である。本稿では,スキル強化エージェントシステムとスキル自体の自動評価フレームワークであるtextsc{OpenSkillEval}について述べる。静的ベンチマークに頼る代わりに、‘textsc{OpenSkillEval} は、プレゼンテーション生成、フロントエンドWebデザイン、ポスター生成、データ可視化、レポート生成という5つのカテゴリのダウンストリームアプリケーションの実世界のアーティファクトから、現実的なタスクインスタンスを自動的に構築する。さらに、統合されたタスク設定の下で、制御された比較のためのコミュニティに分散したスキルを収集し、整理する。 600以上の動的に生成されたタスクインスタンスと30のオープンソーススキルを使用して、最先端のモデルとエージェントフレームワークを体系的に評価する。以上の結果から,スキルの有効利用が保証されていないこと,スキル強化のメリットが基礎となるモデルとエージェントフレームワークの両方に強く依存していること,スキルのないベースエージェントを一貫して上回る技術が広く普及していること,などが示唆された。これらの知見は, LLMエージェントの設計, 選定, 展開に関する実践的な知見を提供するために, 動的, タスクグラウンド評価の必要性を強調した。追加のケースとベンチマークリソースはプロジェクトのWebサイト(https://yingjiahao14.github.io/OpenSkillEval-Web/)で入手できる。

論文の概要: OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents

関連論文リスト