Related papers: Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

URL: http://arxiv.org/abs/2508.05635v1
Date: Thu, 07 Aug 2025 17:59:44 GMT
Title: Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Authors: Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren,
Abstract summary: We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation.<n>GE integrates policy learning, evaluation, and simulation within a single video-generative framework.
Score: 65.30763239365928
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce Genie Envisioner (GE), a unified world foundation platform for robotic manipulation that integrates policy learning, evaluation, and simulation within a single video-generative framework. At its core, GE-Base is a large-scale, instruction-conditioned video diffusion model that captures the spatial, temporal, and semantic dynamics of real-world robotic interactions in a structured latent space. Built upon this foundation, GE-Act maps latent representations to executable action trajectories through a lightweight, flow-matching decoder, enabling precise and generalizable policy inference across diverse embodiments with minimal supervision. To support scalable evaluation and training, GE-Sim serves as an action-conditioned neural simulator, producing high-fidelity rollouts for closed-loop policy development. The platform is further equipped with EWMBench, a standardized benchmark suite measuring visual fidelity, physical consistency, and instruction-action alignment. Together, these components establish Genie Envisioner as a scalable and practical foundation for instruction-driven, general-purpose embodied intelligence. All code, models, and benchmarks will be released publicly.

Related papers

GAMMS: Graph based Adversarial Multiagent Modeling Simulator [15.681127447904322]
We present GAMMS (Graph based Adrial Multiagent Modeling Simulator), a lightweight yet scalable simulation framework.<n>GAMMS emphasizes five core objectives: scalability, ease of use, integration-first architecture, fast visualization feedback, and real-world grounding.<n>It enables efficient simulation of complex domains such as urban road networks and communication systems.
arXiv Detail & Related papers (2026-02-04T22:38:51Z)
Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems [75.78934957242403]
Self-driving vehicles and drones require true Spatial Intelligence from multi-modal onboard sensor data.<n>This paper presents a framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal.
arXiv Detail & Related papers (2025-12-30T17:58:01Z)
Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z)
AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation [1.8266092127796327]
AnywhereVLA is a modular framework for mobile manipulation.<n>A text prompt serves as an entry point and is parsed into a structured task graph.<n>For interaction, a compact SmolVLA manipulation head is fine tuned on platform pick and place trajectories.
arXiv Detail & Related papers (2025-09-25T11:04:44Z)
OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation [49.66156306240961]
We present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation.<n>Our approach leverages a high-capacity vision-language-action backbone and trains with three primary goal modalities.<n>We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks.
arXiv Detail & Related papers (2025-09-23T18:40:29Z)
UNO: Unified Self-Supervised Monocular Odometry for Platform-Agnostic Deployment [22.92093036869778]
We present UNO, a unified visual odometry framework that enables robust and pose estimation across diverse environments.<n>Our approach generalizes effectively across a wide range of real-world scenarios, including autonomous vehicles, aerial drones, mobile robots, and handheld devices.<n>We extensively evaluate our method on three major benchmark datasets.
arXiv Detail & Related papers (2025-06-08T06:30:37Z)
Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System [7.266794815157721]
We propose a hierarchical framework integrating a prompted Large Language Model (LLM) and a fine-tuned Vision Language Model (VLM)<n>The LLM decomposes tasks and constructs a global semantic map, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning.<n>This is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.
arXiv Detail & Related papers (2025-06-05T13:27:41Z)
FLARE: Robot Learning with Implicit World Modeling [87.81846091038676]
$textbfFLARE$ integrates predictive latent world modeling into robot policy learning.<n>$textbfFLARE$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%.<n>Our results establish $textbfFLARE$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
arXiv Detail & Related papers (2025-05-21T15:33:27Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
IRASim: A Fine-Grained World Model for Robot Manipulation [24.591694756757278]
We present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details.<n>We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.
arXiv Detail & Related papers (2024-06-20T17:50:16Z)
iVideoGPT: Interactive VideoGPTs are Scalable World Models [70.02290687442624]
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. This work introduces Interactive VideoGPT, a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations.
arXiv Detail & Related papers (2024-05-24T05:29:12Z)
GEM: Group Enhanced Model for Learning Dynamical Control Systems [78.56159072162103]
We build effective dynamical models that are amenable to sample-based learning. We show that learning the dynamics on a Lie algebra vector space is more effective than learning a direct state transition model. This work sheds light on a connection between learning of dynamics and Lie group properties, which opens doors for new research directions.
arXiv Detail & Related papers (2021-04-07T01:08:18Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.