Agent Learning via Early Experience
- URL: http://arxiv.org/abs/2510.08558v2
- Date: Mon, 13 Oct 2025 23:25:00 GMT
- Title: Agent Learning via Early Experience
- Authors: Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu,
- Abstract summary: A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks.<n>Most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly.<n>We study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making.
- Score: 93.83579011718858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
Related papers
- Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation [57.65688895630163]
We introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data.<n>Our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without forgetting existing environments.
arXiv Detail & Related papers (2026-02-10T23:06:02Z) - Self-Consolidation for Self-Evolving Agents [51.94826934403236]
Large language model (LLM) agents operate as static systems, lacking the ability to evolve through lifelong interaction.<n>We propose a novel self-evolving framework for LLM agents that introduces a complementary evolution mechanism.
arXiv Detail & Related papers (2026-02-02T11:16:07Z) - Large Language Model Agents Are Not Always Faithful Self-Evolvers [84.08646612111092]
Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience.<n>We present the first systematic investigation of experience faithfulness, the causal dependence of an agent's decisions on the experience it is given.
arXiv Detail & Related papers (2026-01-30T01:05:15Z) - What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z) - From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process.<n>We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z) - OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization [66.22117723598872]
We introduce an open-source framework designed to facilitate the development of multimodal web agent.
We first train the base model with imitation learning to gain the basic abilities.
We then let the agent explore the open web and collect feedback on its trajectories.
arXiv Detail & Related papers (2024-10-25T15:01:27Z) - "Give Me an Example Like This": Episodic Active Reinforcement Learning from Demonstrations [3.637365301757111]
Methods like Reinforcement Learning from Expert Demonstrations (RLED) introduce external expert demonstrations to facilitate agent exploration during the learning process.
How to select the best set of human demonstrations that is most beneficial for learning becomes a major concern.
This paper presents EARLY, an algorithm that enables a learning agent to generate optimized queries of expert demonstrations in a trajectory-based feature space.
arXiv Detail & Related papers (2024-06-05T08:52:21Z) - Maximum diffusion reinforcement learning [7.334017970483869]
Correlations create fundamental challenges for machine learning.
In reinforcement learning, where data are directly collected from an agent's sequential experiences, violations of this assumption are often unavoidable.
By decorrelating agent experiences, our approach provably enables single-shot learning in continuous deployments.
arXiv Detail & Related papers (2023-09-26T22:14:56Z) - ExpeL: LLM Agents Are Experiential Learners [57.13685954854463]
We introduce the Experiential Learning (ExpeL) agent to allow learning from agent experiences without requiring parametric updates.<n>Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks.<n>At inference, the agent recalls its extracted insights and past experiences to make informed decisions.
arXiv Detail & Related papers (2023-08-20T03:03:34Z) - Asking Before Acting: Gather Information in Embodied Decision Making with Language Models [20.282749796376063]
We show that Large Language Models (LLMs) encounter challenges in efficiently gathering essential information in unfamiliar environments.
We propose textitAsking Before Acting (ABA), a method that empowers the agent to proactively inquire with external sources for pertinent information using natural language.
We conduct extensive experiments involving a spectrum of environments including text-based household everyday tasks, robot arm manipulation tasks, and real world open domain image based embodied tasks.
arXiv Detail & Related papers (2023-05-25T04:05:08Z) - Co-Imitation Learning without Expert Demonstration [39.988945772085465]
We propose a novel learning framework called Co-Imitation Learning (CoIL) to exploit the past good experiences of the agents without expert demonstration.
While the experiences could be valuable or misleading, we propose to estimate the potential utility of each piece of experience with the expected gain of the value function.
Experimental results on various tasks show significant superiority of the proposed Co-Imitation Learning framework.
arXiv Detail & Related papers (2021-03-27T06:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.