Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
- URL: http://arxiv.org/abs/2602.11964v1
- Date: Thu, 12 Feb 2026 13:58:27 GMT
- Title: Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments
- Authors: Romain Froger, Pierre Andrews, Matteo Bettini, Amar Budhiraja, Ricardo Silveira Cabral, Virginie Do, Emilien Garreau, Jean-Baptiste Gaya, Hugo Laurençon, Maxime Lecanu, Kunal Malkan, Dheeraj Mekala, Pierre Ménard, Gerard Moreno-Torres Bertran, Ulyana Piterbarg, Mikhail Plekhanov, Mathieu Rita, Andrey Rusakov, Vladislav Vorotilov, Mengjue Wang, Ian Yu, Amine Benhalloum, Grégoire Mialon, Thomas Scialom,
- Abstract summary: We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments.<n>Gaia2 introduces scenarios where environments evolve independently of agent actions.<n>Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform.
- Score: 22.98982051873728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Gaia2, a benchmark for evaluating large language model agents in realistic, asynchronous environments. Unlike prior static or synchronous evaluations, Gaia2 introduces scenarios where environments evolve independently of agent actions, requiring agents to operate under temporal constraints, adapt to noisy and dynamic events, resolve ambiguity, and collaborate with other agents. Each scenario is paired with a write-action verifier, enabling fine-grained, action-level evaluation and making Gaia2 directly usable for reinforcement learning from verifiable rewards. Our evaluation of state-of-the-art proprietary and open-source models shows that no model dominates across capabilities: GPT-5 (high) reaches the strongest overall score of 42% pass@1 but fails on time-sensitive tasks, Claude-4 Sonnet trades accuracy and speed for cost, Kimi-K2 leads among open-source models with 21% pass@1. These results highlight fundamental trade-offs between reasoning, efficiency, robustness, and expose challenges in closing the "sim2real" gap. Gaia2 is built on a consumer environment with the open-source Agents Research Environments platform and designed to be easy to extend. By releasing Gaia2 alongside the foundational ARE framework, we aim to provide the community with a flexible infrastructure for developing, benchmarking, and training the next generation of practical agent systems.
Related papers
- Autonomous Continual Learning of Computer-Use Agents for Environment Adaptation [57.65688895630163]
We introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data.<n>Our method effectively enables both intra-environment and cross-environment continual learning, yielding 4-22% performance gains without forgetting existing environments.
arXiv Detail & Related papers (2026-02-10T23:06:02Z) - What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z) - Let It Flow: Agentic Crafting on Rock and Roll, Building the ROME Model within an Open Agentic Learning Ecosystem [90.17610617854247]
We introduce the Agentic Learning Ecosystem (ALE), a foundational infrastructure that optimize the production pipeline for agentic model.<n>ALE consists of three components: ROLL, a post-training framework for weight optimization; ROCK, a sandbox environment manager for trajectory generation; and iFlow CLI, an agent framework for efficient context engineering.<n>We release ROME, an open-source agent grounded by ALE and trained on over one million trajectories.
arXiv Detail & Related papers (2025-12-31T14:03:39Z) - Grounded Test-Time Adaptation for LLM Agents [75.62784644919803]
Large language model (LLM)-based agents struggle to generalize to novel and complex environments.<n>We propose two strategies for adapting LLM agents by leveraging environment-specific information available during deployment.
arXiv Detail & Related papers (2025-11-06T22:24:35Z) - ARE: Scaling Up Agent Environments and Evaluations [22.98982051873728]
We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments.<n>ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers.<n>We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities.
arXiv Detail & Related papers (2025-09-21T16:59:45Z) - Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection [108.5042835056188]
This work introduces Agent4FaceForgery to address two fundamental problems.<n>How to capture the diverse intents and iterative processes of human forgery creation.<n>How to model the complex, often adversarial, text-image interactions that accompany forgeries in social media.
arXiv Detail & Related papers (2025-09-16T01:05:01Z) - $τ^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment [32.345011712015435]
Existing benchmarks for AI agents simulate single-control environments.<n>We introduce $tau2$-bench, where both agent and user make use of tools to act in a shared, dynamic environment.<n>In particular, our experiments show significant performance drops when agents shift from no-user to dual-control.
arXiv Detail & Related papers (2025-06-09T17:52:18Z) - GAIA: A Foundation Model for Operational Atmospheric Dynamics [0.83442357861662]
We introduce GAIA, a hybrid self-supervised model that fuses Masked Autoencoders (MAE) with self-distillation with no labels (DINO)<n>GAIA learns disentangled representations that capture atmospheric dynamics rather than trivial diurnal patterns.<n>When transferred to downstream tasks, GAIA consistently outperforms an MAE-only baseline.
arXiv Detail & Related papers (2025-05-15T05:07:09Z) - R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents [32.06393076572057]
AgentGym is the largest procedurally-curated executable gym environment for training real-world SWE-agents.<n>It is powered by two main contributions: SYNGEN, a synthetic data curation recipe, and Hybrid Test-time Scaling.<n>Overall, our approach achieves 51% on the SWE-Bench Verified benchmark, reflecting a new state-of-the-art for open-weight SWE-agents.
arXiv Detail & Related papers (2025-04-09T17:55:19Z) - The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.