Related papers: EvoSkill: Automated Skill Discovery for Multi-Agent Systems

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

URL: http://arxiv.org/abs/2603.02766v1
Date: Tue, 03 Mar 2026 09:07:22 GMT
Title: EvoSkill: Automated Skill Discovery for Multi-Agent Systems
Authors: Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, Tu Vu,
Abstract summary: We introduce textbfEvoSkill, a self-evolving framework that automatically discovers and refines agent skills.<n>EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders.<n>We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S. Treasury data, and SealQA, a noisy retrieval benchmark.
Score: 6.319876096746374
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3\%} (60.6\% $\to$ 67.9\%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1\%} gain (26.6\% $\to$ 38.7\%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3\%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.

Related papers

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale [28.43462779191672]
AgentSkillOS is a principled framework for skill selection, orchestration, and ecosystem-level management.<n>AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree.<n> (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines.
arXiv Detail & Related papers (2026-03-02T18:46:47Z)
SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? [67.69996753743129]
We introduce SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions.<n> SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions.<n>We propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks.
arXiv Detail & Related papers (2026-02-28T15:44:31Z)
SkillNet: Create, Evaluate, and Connect AI Skills [159.47504178122156]
SkillNet is an open infrastructure designed to create, evaluate, and organize AI skills at scale.<n>Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit.
arXiv Detail & Related papers (2026-02-26T14:24:02Z)
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents [6.356997609995175]
Agentic systems increasingly rely on reusable procedural capabilities, textita.k.a., agentic skills, to execute long-horizon reliably.<n>This paper maps the skill layer across the full lifecycle (discovery, practice, distillation, storage, composition, evaluation, and update)<n>We analyze the security and governance implications of skill-based agents, covering supply-chain risks, prompt injection via skill payloads, and trust-tiered execution.
arXiv Detail & Related papers (2026-02-24T13:11:38Z)
SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning [83.98129545309277]
We propose SkillRL, a framework that bridges the gap between raw experience and policy improvement.<n>Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank.<n> Experimental results on ALF, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-09T03:17:17Z)
Offline Discovery of Interpretable Skills from Multi-Task Trajectories [8.119611773942562]
We introduce LOKI, a three-stage end-to-end learning framework for offline skill discovery and hierarchical imitation.<n>LOKI achieves high success rates on the challenging D4RL Kitchen benchmark and outperforms standard HIL baselines.
arXiv Detail & Related papers (2026-02-01T05:03:58Z)
Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z)
Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems [72.3575737073235]
Multi-Agent Systems (MAS) solve complex tasks by coordinating multiple agents through.<n>Existing approaches generates either at task level or query level, but their relative costs and benefits remain unclear.<n>We show that query-level workflow generation is not always necessary, since a small set of top-K best task-level together already covers equivalent or even more queries.
arXiv Detail & Related papers (2026-01-16T10:05:51Z)
AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent [80.83250816918861]
Large Reasoning Models (LRMs) like o3 and DeepSeek-R1 have achieved remarkable progress in natural language reasoning with long chain-of-thought.<n>However, they remain computationally inefficient and struggle with accuracy when solving problems requiring complex mathematical operations.<n>We present AgentMath, an agent framework that seamlessly integrates language models' reasoning capabilities with code interpreters' computational precision.
arXiv Detail & Related papers (2025-12-23T19:57:49Z)
Reinforcement Learning for Self-Improving Agent with Skill Library [14.717149089634718]
Large Language Model (LLM)-based agents have demonstrated remarkable capabilities in complex reasoning and multi-turn interactions.<n>One promising approach is implementing skill libraries that allow agents to learn, validate, and apply new skills.<n>We propose a Reinforcement Learning (RL)-based approach to enhance agents' self-improvement capabilities with a skill library.
arXiv Detail & Related papers (2025-12-18T21:58:19Z)
SCOPE: Prompt Evolution for Enhancing Agent Effectiveness [53.75986399936395]
Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts.<n>While agents have access to this context, their static prompts lack the mechanisms to manage it effectively.<n>We introduce textbfSCOPE (Self-evolving Context Optimization via Prompt Evolution)<n>We propose a Dual-Stream mechanism that balances tactical specificity (resolving immediate errors) with strategic generality (evolving long-term principles)
arXiv Detail & Related papers (2025-12-17T12:25:05Z)
Evolving Excellence: Automated Optimization of LLM-based Agents [33.81822162934331]
We present ARTEMIS, a no-code evolutionary optimization platform that jointly optimize agent configurations through semantically-aware genetic operators.<n>We evaluate ARTEMIS on four representative agent systems: the emphALE Agent for competitive programming on AtCoder Heuristic Contest, achieving a textbf$13.6%$ improvement in acceptance rate.<n>We also evaluate the emphMathTales-Teacher Agent powered by a smaller open-source model (Qwen2.5-7B) on GSM8K primary-level mathematics problems, achieving a textbf
arXiv Detail & Related papers (2025-12-09T20:48:45Z)
Reinforcement Learning Integrated Agentic RAG for Software Test Cases Authoring [0.0]
This paper introduces a framework that integrates reinforcement learning (RL) with autonomous agents to enable continuous improvement in the automated process of software test cases authoring from business requirement documents within Quality Engineering (QE)<n>Our proposed Reinforcement Infused Agentic RAG (Retrieve, Augment, Generate) framework overcomes this limitation by employing AI agents that learn from QE feedback, assessments, and defect discovery outcomes to automatically improve their test case generation strategies.
arXiv Detail & Related papers (2025-12-05T17:52:26Z)
Structured Uncertainty guided Clarification for LLM Agents [126.26213027785813]
LLM agents extend large language models with tool-calling capabilities, but ambiguous user instructions often lead to incorrect invocations and task failures.<n>We introduce a principled formulation of structured uncertainty over tool-call parameters, modeling joint tool-argument clarification as a POMDP with Expected Value of Perfect Information (EVPI) objective for optimal question selection and aspect-based cost modeling to prevent redundancy.<n>Our SAGE-Agent leverages this structured uncertainty to achieve superior efficiency: increasing coverage on ambiguous tasks by 7-39% while reducing clarification questions by 1.5-2.7$times$ compared to strong prompting and uncertainty-based baselines.
arXiv Detail & Related papers (2025-11-11T21:50:44Z)
Alita-G: Self-Evolving Generative Agent for Agent Generation [54.49365835457433]
We present ALITA-G, a framework that transforms a general-purpose agent into a domain expert.<n>In this framework, a generalist agent executes a curated suite of target-domain tasks.<n>It attains strong gains while reducing computation costs.
arXiv Detail & Related papers (2025-10-27T17:59:14Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Tournament of Prompts: Evolving LLM Instructions Through Structured Debates and Elo Ratings [0.9437165725355702]
We introduce DEEVO, a novel framework that guides prompt evolution through a debate-driven evaluation with an Elo-based selection.<n>Using Elo ratings as a fitness proxy, DEEVO simultaneously drives improvement and preserves valuable diversity in the prompt population.
arXiv Detail & Related papers (2025-05-30T19:33:41Z)
Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction [19.43668931500507]
We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements. We release the benchmark dataset for research purposes to stimulate further research on the task.
arXiv Detail & Related papers (2022-09-13T13:37:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.