SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
- URL: http://arxiv.org/abs/2602.12670v1
- Date: Fri, 13 Feb 2026 07:06:06 GMT
- Title: SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
- Authors: Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuandong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing Liu, Haoran Lyu, Ze Ma, Bowei Wang, Runhui Wang, Tianyu Wang, Wengao Ye, Yue Zhang, Hanwen Xing, Yiqi Xue, Steven Dillmann, Han-chung Lee,
- Abstract summary: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time.<n>We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers.<n>We show curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain.<n>Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming.
- Score: 61.89812116484928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.
Related papers
- EvoSkill: Automated Skill Discovery for Multi-Agent Systems [6.319876096746374]
We introduce textbfEvoSkill, a self-evolving framework that automatically discovers and refines agent skills.<n>EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders.<n>We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S. Treasury data, and SealQA, a noisy retrieval benchmark.
arXiv Detail & Related papers (2026-03-03T09:07:22Z) - SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? [67.69996753743129]
We introduce SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions.<n> SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions.<n>We propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks.
arXiv Detail & Related papers (2026-02-28T15:44:31Z) - SkillNet: Create, Evaluate, and Connect AI Skills [159.47504178122156]
SkillNet is an open infrastructure designed to create, evaluate, and organize AI skills at scale.<n>Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit.
arXiv Detail & Related papers (2026-02-26T14:24:02Z) - SoK: Agentic Skills -- Beyond Tool Use in LLM Agents [6.356997609995175]
Agentic systems increasingly rely on reusable procedural capabilities, textita.k.a., agentic skills, to execute long-horizon reliably.<n>This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update)<n>We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution.
arXiv Detail & Related papers (2026-02-24T13:11:38Z) - PolySkill: Learning Generalizable Skills Through Polymorphic Abstraction [20.687269802717893]
We introduce PolySkill, a new framework that enables agents to learn generalizable and compositional skills.<n> Experiments show that our method improves skill reuse by 1.7x on seen websites.<n>By enabling the agent to identify and refine its own goals, the PolySkill enhances the agent's ability to learn a better curriculum.
arXiv Detail & Related papers (2025-10-17T17:56:00Z) - SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills [48.05057798832005]
We introduce SkillWeaver, a skill-centric framework enabling web agents to self-improve by autonomously synthesizing reusable skills as APIs.<n>Given a new website, the agent autonomously discovers skills, executes them for practice, and distills practice experiences into robust APIs.<n>Experiments on WebArena and real-world websites demonstrate the efficacy of SkillWeaver, achieving relative success rate improvements of 31.8% and 39.8%, respectively.
arXiv Detail & Related papers (2025-04-09T17:51:50Z) - Inducing Programmatic Skills for Agentic Tasks [69.29902147942673]
We propose agent skill induction (ASI) to allow agents to adapt themselves by inducing, verifying, and utilizing program-based skills on the fly.<n>We show that ASI outperforms the static baseline agent and its text-skill counterpart by 23.5% and 11.3% in success rate.
arXiv Detail & Related papers (2025-04-09T12:25:37Z) - Unearthing Skill-Level Insights for Understanding Trade-Offs of Foundation Models [61.467781476005435]
skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain.
We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated rationales.
Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.
arXiv Detail & Related papers (2024-10-17T17:51:40Z) - SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness [11.083396379885478]
We release SkillMatch, a benchmark for the task of skill relatedness based on expert knowledge mining from millions of job ads.
We also propose a scalable self-supervised learning technique to adapt a Sentence-BERT model based on skill co-occurrence in job ads.
arXiv Detail & Related papers (2024-10-07T13:05:26Z) - SkillMimic: Learning Basketball Interaction Skills from Demonstrations [85.23012579911378]
We introduce SkillMimic, a unified data-driven framework that fundamentally changes how agents learn interaction skills.<n>Our key insight is that a unified HOI imitation reward can effectively capture the essence of diverse interaction patterns from HOI datasets.<n>For evaluation, we collect and introduce two basketball datasets containing approximately 35 minutes of diverse basketball skills.
arXiv Detail & Related papers (2024-08-12T15:19:04Z) - Agentic Skill Discovery [19.5703917813767]
Language-conditioned robotic skills make it possible to apply the high-level reasoning of Large Language Models (LLMs) to low-level robotic control.
A remaining challenge is to acquire a diverse set of fundamental skills.
We introduce a novel framework for skill discovery that is entirely driven by LLMs.
arXiv Detail & Related papers (2024-05-23T19:44:03Z) - Residual Skill Policies: Learning an Adaptable Skill-based Action Space
for Reinforcement Learning for Robotics [18.546688182454236]
Skill-based reinforcement learning (RL) has emerged as a promising strategy to leverage prior knowledge for accelerated robot learning.
We propose accelerating exploration in the skill space using state-conditioned generative models.
We validate our approach across four challenging manipulation tasks, demonstrating our ability to learn across task variations.
arXiv Detail & Related papers (2022-11-04T02:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.