Self-Improving World Modelling with Latent Actions
- URL: http://arxiv.org/abs/2602.06130v1
- Date: Thu, 05 Feb 2026 19:04:41 GMT
- Title: Self-Improving World Modelling with Latent Actions
- Authors: Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti,
- Abstract summary: Internal modelling of the world is essential to reasoning and planning.<n>We propose SWIRL, a self-improvement framework that learns from state-only sequences.
- Score: 53.93276450137471
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_θ(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_φ(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.
Related papers
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination [55.982504915794514]
We propose Self-Correcting VLA (SC-VLA), which achieve self-improvement by intrinsically guiding action refinement through sparse imagination.<n>SC-VLA achieve state-of-the-art performance, yielding the highest task throughput with 16% fewer steps and a 9% higher success rate than the best-performing baselines.
arXiv Detail & Related papers (2026-02-25T06:58:06Z) - Reinforcement World Model Learning for LLM-based Agents [60.65003139516272]
Reinforcement World Model Learning (RWML) is a self-conditioned method that learns action-supervised world models for LLM-based agents.<n>Our method aligns simulated next states produced by the model with realized next states observed from the environment.<n>We evaluate our method on ALFWorld and $2$ Bench and observe significant gains over the base model, despite being entirely self-supervised.
arXiv Detail & Related papers (2026-02-05T16:30:08Z) - Internalizing World Models via Self-Play Finetuning for Agentic RL [65.96875390986655]
Large Language Models (LLMs) as agents often struggle in out-of-distribution (OOD) scenarios.<n>We show how to encode this world model by decomposing it into two components: state representation and transition modeling.<n>We introduce SPA, a simple reinforcement learning framework that cold-starts the policy via a Self-Play supervised finetuning stage to learn the world model.
arXiv Detail & Related papers (2025-10-16T18:03:39Z) - OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging [124.91183814854126]
Model merging seeks to combine multiple expert models into a single model.<n>We introduce a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation.<n>We find that model merging offers a promising way for building improved MLLMs without requiring training data.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - FLARE: Robot Learning with Implicit World Modeling [87.81846091038676]
$textbfFLARE$ integrates predictive latent world modeling into robot policy learning.<n>$textbfFLARE$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%.<n>Our results establish $textbfFLARE$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
arXiv Detail & Related papers (2025-05-21T15:33:27Z) - Uncovering Untapped Potential in Sample-Efficient World Model Agents [51.65485693709418]
Simulus is a highly modular TBWM agent that integrates a multi-modality tokenization framework, intrinsic motivation, prioritized WM replay, and regression-as-classification.<n>Simulus achieves state-of-the-art sample efficiency for planning-free WMs across three diverse benchmarks.
arXiv Detail & Related papers (2025-02-17T08:06:10Z) - Towards Coupling Full-disk and Active Region-based Flare Prediction for
Operational Space Weather Forecasting [0.5872014229110215]
We present new approaches to train and deploy an operational solar flare prediction system for $geq$M1.0-class flares.
In full-disk mode, predictions are performed on full-disk line-of-sight magnetograms using deep learning models.
In active region-based models, predictions are issued for each active region individually.
arXiv Detail & Related papers (2022-08-11T22:34:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.