Related papers: Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions

Related papers

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z)
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z)
What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z)
Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey [30.673419015614233]
A growing consensus is that agents should interact directly with environments and learn from experience through reinforcement learning.<n>We formalize this iterative process as the Generation-Execution-Feedback (GEF) loop, where environments generate tasks to challenge agents, return observations in response to agents' actions during task execution, and provide evaluative feedback on rollouts for subsequent learning.<n>Under this paradigm, environments function as indispensable producers of experiential data, highlighting the need to scale them toward greater complexity, realism, and interactivity.
arXiv Detail & Related papers (2025-11-12T12:56:25Z)
A Survey on Agentic Multimodal Large Language Models [84.18778056010629]
We present a comprehensive survey on Agentic Multimodal Large Language Models (Agentic MLLMs)<n>We explore the emerging paradigm of agentic MLLMs, delineating their conceptual foundations and distinguishing characteristics from conventional MLLM-based agents.<n>To further accelerate research in this area for the community, we compile open-source training frameworks, training and evaluation datasets for developing agentic MLLMs.
arXiv Detail & Related papers (2025-10-13T04:07:01Z)
The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation [0.16921396880325779]
We introduce a novel evaluation framework that uses multi-agent debate as a controlled "social laboratory"<n>We show that assigned personas induce stable, measurable psychometric profiles, particularly in cognitive effort.<n>This work provides a blueprint for a new class of dynamic, psychometrically grounded evaluation protocols.
arXiv Detail & Related papers (2025-10-01T07:10:28Z)
Evolutionary Perspectives on the Evaluation of LLM-Based AI Agents: A Comprehensive Survey [45.485318955120924]
The transition from traditional large language models (LLMs) to more advanced AI agents represents a pivotal evolutionary step.<n>Existing evaluation frameworks often blur the distinctions between LLM chatbots and AI agents, leading to confusion among researchers selecting appropriate benchmarks.<n>This paper introduces a systematic analysis of current evaluation approaches, grounded in an evolutionary perspective.
arXiv Detail & Related papers (2025-06-06T17:52:18Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking [12.218102495632937]
We present an open-source benchmark $C3$-Bench to assess agent robustness.<n>In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths.<n>In essence, $C3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance.
arXiv Detail & Related papers (2025-05-24T15:25:44Z)
MAFE: Multi-Agent Fair Environments for Decision-Making Systems [30.91792275900066]
We introduce the concept of a Multi-Agent Fair Environment (MAFE) and present and analyze three MAFEs that model distinct social systems. Experimental results demonstrate the utility of our MAFEs as testbeds for developing multi-agent fair algorithms.
arXiv Detail & Related papers (2025-02-25T04:03:50Z)
AgentAlign: Misalignment-Adapted Multi-Agent Perception for Resilient Inter-Agent Sensor Correlations [8.916036880001734]
Existing research overlooks the fragile multi-sensor correlations in multi-agent settings. AgentAlign is a real-world heterogeneous agent cross-modality feature alignment framework. We present a novel V2XSet-noise dataset that simulates realistic sensor imperfections under diverse environmental conditions.
arXiv Detail & Related papers (2024-12-09T01:51:18Z)
MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents. Existing benchmarks mostly assess their reasoning abilities in language part. MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z)
R-AIF: Solving Sparse-Reward Robotic Tasks from Pixels with Active Inference and World Models [50.19174067263255]
We introduce prior preference learning techniques and self-revision schedules to help the agent excel in sparse-reward, continuous action, goal-based robotic control POMDP environments. We show that our agents offer improved performance over state-of-the-art models in terms of cumulative rewards, relative stability, and success rate.
arXiv Detail & Related papers (2024-09-21T18:32:44Z)
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments [93.94020724735199]
HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind. This benchmark enables us to evaluate autonomous agents' decision-making capabilities across various pipelines.
arXiv Detail & Related papers (2024-01-23T18:59:43Z)
Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data. We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z)
INTAGS: Interactive Agent-Guided Simulation [4.04638613278729]
In many applications involving multi-agent system (MAS), it is imperative to test an experimental (Exp) autonomous agent in a high-fidelity simulator prior to its deployment to production. We propose a metric to distinguish between real and synthetic multi-agent systems, which is evaluated through the live interaction between the Exp and BG agents. We show that using INTAGS to calibrate the simulator can generate more realistic market data compared to the state-of-the-art conditional Wasserstein Generative Adversarial Network approach.
arXiv Detail & Related papers (2023-09-04T19:56:18Z)
AgentBench: Evaluating LLMs as Agents [88.45506148281379]
Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. We present AgentBench, a benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
Semantic Tracklets: An Object-Centric Representation for Visual Multi-Agent Reinforcement Learning [126.57680291438128]
We study whether scalability can be achieved via a disentangled representation. We evaluate semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment. Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.
arXiv Detail & Related papers (2021-08-06T22:19:09Z)
Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design [121.73425076217471]
We propose Unsupervised Environment Design (UED), where developers provide environments with unknown parameters, and these parameters are used to automatically produce a distribution over valid, solvable environments. We call our technique Protagonist Antagonist Induced Regret Environment Design (PAIRED) Our experiments demonstrate that PAIRED produces a natural curriculum of increasingly complex environments, and PAIRED agents achieve higher zero-shot transfer performance when tested in highly novel environments.
arXiv Detail & Related papers (2020-12-03T17:37:01Z)
Heterogeneous Multi-Agent Reinforcement Learning for Unknown Environment Mapping [0.0]
We present an actor-critic algorithm that allows a team of heterogeneous agents to learn decentralized control policies for covering an unknown environment. This task is of interest to national security and emergency response organizations that would like to enhance situational awareness in hazardous areas by deploying teams of unmanned aerial vehicles.
arXiv Detail & Related papers (2020-10-06T12:23:05Z)
Relational-Grid-World: A Novel Relational Reasoning Environment and An Agent Model for Relational Information Extraction [0.0]
Reinforcement learning (RL) agents are often designed specifically for a particular problem and they generally have uninterpretable working processes. Statistical methods-based RL algorithms can be improved in terms of generalizability and interpretability using symbolic Artificial Intelligence (AI) tools such as logic programming. We present a model-free RL architecture that is supported with explicit relational representations of the environmental objects.
arXiv Detail & Related papers (2020-07-12T11:30:48Z)
Diagnosing the Environment Bias in Vision-and-Language Navigation [102.02103792590076]
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. Recent works that study VLN observe a significant performance drop when tested on unseen environments, indicating that the neural agent models are highly biased towards training environments. In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias.
arXiv Detail & Related papers (2020-05-06T19:24:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.