The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios
- URL: http://arxiv.org/abs/2601.08173v1
- Date: Tue, 13 Jan 2026 03:09:18 GMT
- Title: The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios
- Authors: Daocheng Fu, Jianbiao Mei, Rong Wu, Xuemeng Yang, Jia Xu, Ding Wang, Pinlong Cai, Yong Liu, Licheng Wen, Botian Shi,
- Abstract summary: We introduce method, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting.<n>Unlike traditional benchmarks, method evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks.<n>Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios.
- Score: 34.25281365374991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce \method{}, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, \method{} evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios. Our codes are available at https://github.com/KnowledgeXLab/EvoEnv
Related papers
- Automatic Cognitive Task Generation for In-Situ Evaluation of Embodied Agents [43.01384379901339]
We propose a dynamic in-situ task generation method for unseen environments inspired by human cognition.<n>In the interaction stage, the agent actively interacts with the environment, creating a loop between task execution and generation.<n>Experiments across 10 unseen scenes demonstrate that TEA automatically generated 87,876 tasks in two cycles.
arXiv Detail & Related papers (2026-02-05T03:07:00Z) - What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z) - CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL [35.086788669916594]
Large language model based agents are increasingly deployed in complex, tool augmented environments.<n>Existing approaches typically assume predefined task collections, an assumption that fails in novel environments.<n>We propose CuES, a Curiosity driven and Environment grounded Synthesis framework that autonomously generates diverse, executable, and meaningful tasks.
arXiv Detail & Related papers (2025-12-01T06:11:37Z) - Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey [30.673419015614233]
A growing consensus is that agents should interact directly with environments and learn from experience through reinforcement learning.<n>We formalize this iterative process as the Generation-Execution-Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents' actions during task execution, and provide evaluative feedback on rollouts for subsequent learning.<n>Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity.
arXiv Detail & Related papers (2025-11-12T12:56:25Z) - Grounded Test-Time Adaptation for LLM Agents [75.62784644919803]
Large language model (LLM)-based agents struggle to generalize to novel and complex environments.<n>We propose two strategies for adapting LLM agents by leveraging environment-specific information available during deployment.
arXiv Detail & Related papers (2025-11-06T22:24:35Z) - Agent Learning via Early Experience [93.83579011718858]
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks.<n>Most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly.<n>We study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making.
arXiv Detail & Related papers (2025-10-09T17:59:17Z) - Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [57.59000694149105]
We introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents.<n>ELL is built on four core principles: Experience Exploration, Long-term Memory, Skill Learning and Knowledge Internalization.<n>We also introduce StuLife, a benchmark dataset for ELL that simulates a student's holistic college journey.
arXiv Detail & Related papers (2025-08-26T13:04:28Z) - No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language.<n>To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates.<n>We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z) - Online Continual Learning For Interactive Instruction Following Agents [20.100312650193228]
We argue that such a learning scenario is less realistic since a robotic agent is supposed to learn the world continuously as it explores and perceives it.
We propose two continual learning setups for embodied agents; learning new behaviors and new environments.
arXiv Detail & Related papers (2024-03-12T11:33:48Z) - Dynamic Regret Analysis for Online Meta-Learning [0.0]
The online meta-learning framework has arisen as a powerful tool for the continual lifelong learning setting.
This formulation involves two levels: outer level which learns meta-learners and inner level which learns task-specific models.
We establish performance in terms of dynamic regret which handles changing environments from a global prospective.
We carry out our analyses in a setting, and in expectation prove a logarithmic local dynamic regret which explicitly depends on the total number of iterations.
arXiv Detail & Related papers (2021-09-29T12:12:59Z) - Variational Dynamic for Self-Supervised Exploration in Deep Reinforcement Learning [12.76337275628074]
In this work, we propose a variational dynamic model based on the conditional variational inference to model the multimodality andgenerativeity.
We derive an upper bound of the negative log-likelihood of the environmental transition and use such an upper bound as the intrinsic reward for exploration.
Our method outperforms several state-of-the-art environment model-based exploration approaches.
arXiv Detail & Related papers (2020-10-17T09:54:51Z) - Meta Reinforcement Learning with Autonomous Inference of Subtask
Dependencies [57.27944046925876]
We propose and address a novel few-shot RL problem, where a task is characterized by a subtask graph.
Instead of directly learning a meta-policy, we develop a Meta-learner with Subtask Graph Inference.
Our experiment results on two grid-world domains and StarCraft II environments show that the proposed method is able to accurately infer the latent task parameter.
arXiv Detail & Related papers (2020-01-01T17:34:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.