Related papers: EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection

URL: http://arxiv.org/abs/2603.04900v1
Date: Thu, 05 Mar 2026 07:42:53 GMT
Title: EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection
Authors: Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, Eduard Hovy,
Abstract summary: EvoTool decomposes agent's tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer.<n>It iteratively improves them in a self-improving loop through three novel mechanisms.<n>It outperforms strong baselines by over 5 points on GPT-4.1 and Qwen3-8B, while achieving superior efficiency and transferability.
Score: 20.648927252425356
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-use policy via a gradient-free evolutionary paradigm. EvoTool decomposes agent's tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self-improving loop through three novel mechanisms. Trajectory-Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback-Guided Targeted Mutation then edits only that module via natural-language critique. Diversity-Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT-4.1 and Qwen3-8B, while achieving superior efficiency and transferability. The code will be released once paper is accepted.

Related papers

AdaEvolve: Adaptive LLM Driven Zeroth-Order Optimization [61.535567824938205]
We introduce AdaEvolve, a framework that reformulates LLM-driven evolution as a hierarchical adaptive optimization problem.<n>AdaEvolve consistently outperforms the open-ended baselines across 185 different open-ended optimization problems.
arXiv Detail & Related papers (2026-02-23T18:45:31Z)
Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls [56.407063247662336]
We introduce Gecko, a comprehensive environment that simulates tool responses using a combination of rules and LLMs.<n>GATS consistently improves the tool calling performance of various LLMs including GPT-4o, GPT-5, and Gemini-3.0-pro.
arXiv Detail & Related papers (2026-02-22T15:02:00Z)
Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution [15.627651452629706]
Large language models (LLMs) struggle with complex, long-horizon reasoning due to their frozen assumption.<n>Inspired by Popper's "conjectures and refutations," we argue that intelligence requires real-time evolution of the model's policy.<n>We introduce a framework that recasts reasoning as a within-instance online optimization process.
arXiv Detail & Related papers (2026-01-28T08:44:34Z)
Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning [58.432996881401415]
Recent work augments large language models (LLMs) with external tools to enable agentic reasoning.<n>We propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt.<n>STA generates benign-looking prompt rewrites from the original one with high semantic fidelity.
arXiv Detail & Related papers (2026-01-24T19:36:51Z)
EvoFSM: Controllable Self-Evolution for Deep Research with Finite State Machines [23.086761228480682]
EvoFSM is a structured self-evolving framework that achieves both adaptability and control by evolving an explicit Finite State Machine.<n>EvoFSM refines the FSM through a small set of constrained operations, and further incorporates a self-evolving memory that distills successful trajectories as reusable priors and failure patterns.<n>In particular, EvoFSM reaches 58.0% accuracy on the DeepSearch benchmark.
arXiv Detail & Related papers (2026-01-14T13:19:13Z)
EvoLattice: Persistent Internal-Population Evolution through Multi-Alternative Quality-Diversity Graph Representations for LLM-Guided Program Discovery [2.1756081703276]
EvoLattice is a framework that represents an entire population of candidate programs or agent behaviors within a single directed acyclic graph.<n>Each node stores multiple persistent alternatives, and every valid path through the graph defines a distinct candidate.<n>EvoLattice produces statistics that reveal how local design choices affect global performance.
arXiv Detail & Related papers (2025-12-15T19:43:06Z)
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use [73.72524040856052]
AgentFlow is a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory.<n>Flow-GRPO tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates.<n>AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks.
arXiv Detail & Related papers (2025-10-07T05:32:44Z)
LLAMA: Multi-Feedback Smart Contract Fuzzing Framework with LLM-Guided Seed Generation [56.84049855266145]
We propose a Multi-feedback Smart Contract Fuzzing framework (LLAMA) that integrates evolutionary mutation strategies, and hybrid testing techniques.<n>LLAMA achieves 91% instruction coverage and 90% branch coverage, while detecting 132 out of 148 known vulnerabilities.<n>These results highlight LLAMA's effectiveness, adaptability, and practicality in real-world smart contract security testing scenarios.
arXiv Detail & Related papers (2025-07-16T09:46:58Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.