FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
- URL: http://arxiv.org/abs/2507.12496v1
- Date: Tue, 15 Jul 2025 21:49:49 GMT
- Title: FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making
- Authors: Yucen Wang, Rui Yu, Shenghua Wan, Le Gan, De-Chuan Zhan,
- Abstract summary: Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels.<n>We propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs.<n>We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations.
- Score: 32.050134958163184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.
Related papers
- Unlocking Smarter Device Control: Foresighted Planning with a World Model-Driven Code Execution Approach [83.21177515180564]
We propose a framework that prioritizes natural language understanding and structured reasoning to enhance the agent's global understanding of the environment.<n>Our method outperforms previous approaches, particularly achieving a 44.4% relative improvement in task success rate.
arXiv Detail & Related papers (2025-05-22T09:08:47Z) - A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards [29.923942622540356]
We introduce Iterative Keypoint Reward (IKER), a Python-based reward function that serves as a dynamic task specification.<n>We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning policies.<n>The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments.
arXiv Detail & Related papers (2025-02-12T18:57:22Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Learning Latent Dynamic Robust Representations for World Models [9.806852421730165]
Visual Model-Based Reinforcement Learning (MBL) promises to agent's knowledge about the underlying dynamics of the environment.
Top-temporal agents such as Dreamer often struggle with visual pixel-based inputs in the presence of irrelevant noise in the observation space.
We apply a-temporal masking strategy, combined with latent reconstruction, to capture endogenous task-specific aspects of the environment for world models.
arXiv Detail & Related papers (2024-05-10T06:28:42Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - Continual Visual Reinforcement Learning with A Life-Long World Model [55.05017177980985]
We present a new continual learning approach for visual dynamics modeling.<n>We first introduce the life-long world model, which learns task-specific latent dynamics.<n>Then, we address the value estimation challenge for previous tasks with the exploratory-conservative behavior learning approach.
arXiv Detail & Related papers (2023-03-12T05:08:03Z) - World Value Functions: Knowledge Representation for Learning and
Planning [14.731788603429774]
We propose world value functions (WVFs), a type of goal-oriented general value function.
WVFs represent how to solve not just a given task, but any other goal-reaching task in an agent's environment.
We show that WVFs can be learned faster than regular value functions, while their ability to infer the environment's dynamics can be used to integrate learning and planning methods.
arXiv Detail & Related papers (2022-06-23T18:49:54Z) - Multitask Adaptation by Retrospective Exploration with Learned World
Models [77.34726150561087]
We propose a meta-learned addressing model called RAMa that provides training samples for the MBRL agent taken from task-agnostic storage.
The model is trained to maximize the expected agent's performance by selecting promising trajectories solving prior tasks from the storage.
arXiv Detail & Related papers (2021-10-25T20:02:57Z) - Estimating Disentangled Belief about Hidden State and Hidden Task for
Meta-RL [27.78147889149745]
meta-reinforcement learning (meta-RL) algorithms enable autonomous agents to adapt new tasks from small amount of experience.
In meta-RL, the specification (such as reward function) of current task is hidden from the agent.
We propose estimating disentangled belief about task and states, leveraging an inductive bias that the task and states can be regarded as global and local features of each task.
arXiv Detail & Related papers (2021-05-14T06:11:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.