Related papers: FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making

FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making

URL: http://arxiv.org/abs/2507.12496v1
Date: Tue, 15 Jul 2025 21:49:49 GMT
Title: FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
Authors: Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan,
Abstract summary: Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels.<n>We propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs.<n>We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations.
Score: 32.050134958163184
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.

Related papers

Unified Multimodal Diffusion Forcing for Forceful Manipulation [13.51688687815195]
We propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories.<n>Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory.<n>We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments.
arXiv Detail & Related papers (2025-11-06T21:08:35Z)
World-in-World: World Models in a Closed-Loop World [123.85805788728128]
We introduce World-in-World, the first open platform that benchmarks world models in a closed-loop world that mirrors real agent-environment interactions.<n>We curate four closed-loop environments that rigorously evaluate diverse WMs, prioritize task success as the primary metric, and move beyond the common focus on visual quality.<n>Our study uncovers three surprises: (1) visual quality alone does not guarantee task success, controllability matters more; (2) scaling post-training with action-observation data is more effective than upgrading the pretrained video generators; and (3) allocating more inference-time compute allows WMs to substantially improve closed
arXiv Detail & Related papers (2025-10-20T22:09:15Z)
LinguaFluid: Language Guided Fluid Control via Semantic Rewards in Reinforcement Learning [0.7864304771129751]
We introduce a semantically aligned reinforcement learning method where rewards are computed by aligning the current state with a target semantic instruction.<n>We show that semantic reward can guide learning to achieve competitive control behavior, even in the absence of hand-crafted reward functions.<n>This framework opens new horizons for aligning agent behavior with natural language goals and lays the groundwork for a more seamless integration of larger language models.
arXiv Detail & Related papers (2025-08-08T03:23:56Z)
Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach [83.21177515180564]
We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
arXiv Detail & Related papers (2025-05-22T09:08:47Z)
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards [29.923942622540356]
We introduce Iterative Keypoint Reward (IKER), a Python-based reward function that serves as a dynamic task specification.<n>We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning policies.<n>The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments.
arXiv Detail & Related papers (2025-02-12T18:57:22Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Learning Latent Dynamic Robust Representations for World Models [9.806852421730165]
Visual Model-Based Reinforcement Learning (MBL) promises to agent's knowledge about the underlying dynamics of the environment. Top-temporal agents such as Dreamer often struggle with visual pixel-based inputs in the presence of irrelevant noise in the observation space. We apply a-temporal masking strategy, combined with latent reconstruction, to capture endogenous task-specific aspects of the environment for world models.
arXiv Detail & Related papers (2024-05-10T06:28:42Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Continual Visual Reinforcement Learning with A Life-Long World Model [55.05017177980985]
We present a new continual learning approach for visual dynamics modeling.<n>We first introduce the life-long world model, which learns task-specific latent dynamics.<n>Then, we address the value estimation challenge for previous tasks with the exploratory-conservative behavior learning approach.
arXiv Detail & Related papers (2023-03-12T05:08:03Z)
World Value Functions: Knowledge Representation for Learning and Planning [14.731788603429774]
We propose world value functions (WVFs), a type of goal-oriented general value function. WVFs represent how to solve not just a given task, but any other goal-reaching task in an agent's environment. We show that WVFs can be learned faster than regular value functions, while their ability to infer the environment's dynamics can be used to integrate learning and planning methods.
arXiv Detail & Related papers (2022-06-23T18:49:54Z)
Multitask Adaptation by Retrospective Exploration with Learned World Models [77.34726150561087]
We propose a meta-learned addressing model called RAMa that provides training samples for the MBRL agent taken from task-agnostic storage. The model is trained to maximize the expected agent's performance by selecting promising trajectories solving prior tasks from the storage.
arXiv Detail & Related papers (2021-10-25T20:02:57Z)
Estimating Disentangled Belief about Hidden State and Hidden Task for Meta-RL [27.78147889149745]
meta-reinforcement learning (meta-RL) algorithms enable autonomous agents to adapt new tasks from small amount of experience. In meta-RL, the specification (such as reward function) of current task is hidden from the agent. We propose estimating disentangled belief about task and states, leveraging an inductive bias that the task and states can be regarded as global and local features of each task.
arXiv Detail & Related papers (2021-05-14T06:11:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.