Related papers: R-WoM: Retrieval-augmented World Model For Computer-use Agents

R-WoM: Retrieval-augmented World Model For Computer-use Agents

URL: http://arxiv.org/abs/2510.11892v1
Date: Mon, 13 Oct 2025 19:52:04 GMT
Title: R-WoM: Retrieval-augmented World Model For Computer-use Agents
Authors: Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang,
Abstract summary: Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments.<n>We probe two core capabilities of world models--future state prediction and reward estimation--through three tasks.<n>We propose the Retrieval-augmented World Model (R-WoM), which grounds simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials.
Score: 15.812606459788471
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models--future state prediction and reward estimation--through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs' limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

Related papers

Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z)
Reinforcement World Model Learning for LLM-based Agents [60.65003139516272]
Reinforcement World Model Learning (RWML) is a self-conditioned method that learns action-supervised world models for LLM-based agents.<n>Our method aligns simulated next states produced by the model with realized next states observed from the environment.<n>We evaluate our method on ALFWorld and $2$ Bench and observe significant gains over the base model, despite being entirely self-supervised.
arXiv Detail & Related papers (2026-02-05T16:30:08Z)
Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs [72.08224879435762]
textttLearn-to-Ask is a simulator-free framework for learning and deploying proactive dialogue agents.<n>Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service.
arXiv Detail & Related papers (2025-10-29T12:08:07Z)
SimuRA: Towards General Goal-Oriented Agent via Simulative Reasoning Architecture with LLM-Based World Model [88.04128601981145]
We introduce SimuRA, a goal-oriented architecture for generalized agentic reasoning.<n>modelname overcomes the limitations of autoregressive reasoning by introducing a world model for planning via simulation.<n>World-model-based planning, in particular, shows consistent advantage of up to 124% over autoregressive planning.
arXiv Detail & Related papers (2025-07-31T17:57:20Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
World Modelling Improves Language Model Agents [11.081954466884392]
DyMo is a method that augments large language models with a state prediction capability alongside function calling during post-training.<n>On the Berkeley Calling Leaderboard V2, DyMo improves success rates and significantly reduces hallucinations.
arXiv Detail & Related papers (2025-06-03T14:20:59Z)
Deep Active Inference Agents for Delayed and Long-Horizon Environments [1.693200946453174]
AIF agents rely on accurate immediate predictions and exhaustive planning, a limitation that is exacerbated in delayed environments.<n>We propose a generative-policy architecture featuring a multi-step latent transition that lets the generative model predict an entire horizon in a single look-ahead.<n>We evaluate our agent in an environment that mimics a realistic industrial scenario with delayed and long-horizon settings.
arXiv Detail & Related papers (2025-05-26T11:50:22Z)
Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data [46.65903742010956]
We present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior.<n>Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs achieve only 11.86% accuracy in generating human actions.<n>We also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance.
arXiv Detail & Related papers (2025-03-26T17:33:27Z)
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation [25.26545170310844]
We present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making.<n>Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training.
arXiv Detail & Related papers (2024-10-17T05:37:00Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.