EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
- URL: http://arxiv.org/abs/2510.13220v1
- Date: Wed, 15 Oct 2025 07:16:28 GMT
- Title: EvoTest: Evolutionary Test-Time Learning for Self-Improving Agentic Systems
- Authors: Yufei He, Juncheng Liu, Yue Liu, Yibo Li, Tri Cao, Zhiyuan Hu, Xinxing Xu, Bryan Hooi,
- Abstract summary: A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time.<n>We present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode.
- Score: 59.66823584073748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A fundamental limitation of current AI agents is their inability to learn complex skills on the fly at test time, often behaving like "clever but clueless interns" in novel environments. This severely limits their practical utility. To systematically measure and drive progress on this challenge, we first introduce the Jericho Test-Time Learning (J-TTL) benchmark. J-TTL is a new evaluation setup where an agent must play the same game for several consecutive episodes, attempting to improve its performance from one episode to the next. On J-TTL, we find that existing adaptation methods like reflection, memory, or reinforcement learning struggle. To address the challenges posed by our benchmark, we present EvoTest, an evolutionary test-time learning framework that improves an agent without any fine-tuning or gradients-by evolving the entire agentic system after every episode. EvoTest has two roles: the Actor Agent, which plays the game, and the Evolver Agent, which analyzes the episode transcript to propose a revised configuration for the next run. This configuration rewrites the prompt, updates memory by logging effective state-action choices, tunes hyperparameters, and learns the tool-use routines. On our J-TTL benchmark, EvoTest consistently increases performance, outperforming not only reflection and memory-only baselines but also more complex online fine-tuning methods. Notably, our method is the only one capable of winning two games (Detective and Library), while all baselines fail to win any.
Related papers
- Toward Training Superintelligent Software Agents through Self-Play SWE-RL [66.11447353341926]
Self-play SWE-RL is a first step toward training paradigms for superintelligent software agents.<n>Our approach takes minimal data assumptions, only requiring access to sandboxed repositories with source code and installed dependencies.<n>Our results, albeit early, suggest a path where agents autonomously gather extensive learning experiences from real-world software repositories.
arXiv Detail & Related papers (2025-12-21T00:49:40Z) - SAGE: Semantic-Aware Gray-Box Game Regression Testing with Large Language Models [12.705802209782506]
SAGE is a semanticaware regression testing framework for gray-box game environments.<n>It addresses the core challenges of test generation, maintenance, and selection.<n>It achieves superior bug detection with significantly lower execution cost, while demonstrating strong adaptability to version updates.
arXiv Detail & Related papers (2025-11-29T17:09:18Z) - Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm [60.36837655498119]
We propose a Trajectory-based validated-by-Reproducing Agent-benchmark Complexity Evolution framework.<n>This framework takes an original task from an existing benchmark and encourages agents to evolve it into a new task with higher difficulty.<n>Experiments on the GAIA benchmark demonstrate that the TRACE framework consistently enhances task complexity while improving the reliability of correctness.
arXiv Detail & Related papers (2025-10-01T01:52:52Z) - The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements [87.61432174951891]
A critical capability toward scientific progress is the ability to reproduce existing work.<n>To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark.<n>We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark.
arXiv Detail & Related papers (2025-06-27T17:44:32Z) - Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback [12.61197377492141]
Large language models (LLMs) show impressive decision-making abilities.
Current methods lack a mechanism for automatic self-improvement from errors during task execution.
We propose LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers.
arXiv Detail & Related papers (2024-10-07T18:55:53Z) - AutoPenBench: Benchmarking Generative Agents for Penetration Testing [42.681170697805726]
This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing.
We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack.
We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction.
arXiv Detail & Related papers (2024-10-04T08:24:15Z) - Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.<n>We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z) - Solving Offline Reinforcement Learning with Decision Tree Regression [0.0]
This study presents a novel approach to addressing offline reinforcement learning problems by reframing them as regression tasks.
We introduce two distinct frameworks: return-conditioned and return-weighted decision tree policies.
Despite the simplification inherent in this reformulated approach to offline RL, our agents demonstrate performance that is at least on par with the established methods.
arXiv Detail & Related papers (2024-01-21T23:50:46Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.