SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
- URL: http://arxiv.org/abs/2512.10046v1
- Date: Wed, 10 Dec 2025 20:04:08 GMT
- Title: SimWorld-Robotics: Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration
- Authors: Yan Zhuang, Jiawei Ren, Xiaokang Ye, Jianzhi Shen, Ruixuan Zhang, Tianai Yue, Muhammad Faayez, Xuhong He, Ziqiao Ma, Lianhui Qin, Zhiting Hu, Tianmin Shu,
- Abstract summary: We present SimWorld-Robotics, a simulation platform for embodied AI in large-scale, photorealistic urban environments.<n>Built on Unreal Engine 5, SWR procedurally generates unlimited urban scenes populated with dynamic elements such as pedestrians and traffic systems.<n>We build two challenging robot benchmarks, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic.
- Score: 32.271201714566885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in foundation models have shown promising results in developing generalist robotics that can perform diverse tasks in open-ended scenarios given multimodal inputs. However, current work has been mainly focused on indoor, household scenarios. In this work, we present SimWorld-Robotics~(SWR), a simulation platform for embodied AI in large-scale, photorealistic urban environments. Built on Unreal Engine 5, SWR procedurally generates unlimited photorealistic urban scenes populated with dynamic elements such as pedestrians and traffic systems, surpassing prior urban simulations in realism, complexity, and scalability. It also supports multi-robot control and communication. With these key features, we build two challenging robot benchmarks: (1) a multimodal instruction-following task, where a robot must follow vision-language navigation instructions to reach a destination in the presence of pedestrians and traffic; and (2) a multi-agent search task, where two robots must communicate to cooperatively locate and meet each other. Unlike existing benchmarks, these two new benchmarks comprehensively evaluate a wide range of critical robot capacities in realistic scenarios, including (1) multimodal instructions grounding, (2) 3D spatial reasoning in large environments, (3) safe, long-range navigation with people and traffic, (4) multi-robot collaboration, and (5) grounded communication. Our experimental results demonstrate that state-of-the-art models, including vision-language models (VLMs), struggle with our tasks, lacking robust perception, reasoning, and planning abilities necessary for urban environments.
Related papers
- EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models [31.768426199719816]
We propose EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions.<n>We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives.<n>We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations.
arXiv Detail & Related papers (2026-02-04T13:04:56Z) - The Sound of Simulation: Learning Multimodal Sim-to-Real Robot Policies with Generative Audio [138.07247714782412]
MultiGen is a framework that integrates large-scale generative models into traditional physics simulators.<n>We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids.
arXiv Detail & Related papers (2025-07-03T17:59:58Z) - Towards Autonomous Micromobility through Scalable Urban Simulation [52.749987132021324]
Current micromobility depends mostly on human manual operation (in-person or remote control)<n>In this work, we present a scalable urban simulation solution to advance autonomous micromobility.
arXiv Detail & Related papers (2025-05-01T17:52:29Z) - GRAPPA: Generalizing and Adapting Robot Policies via Online Agentic Guidance [15.774237279917594]
We propose an agentic framework for robot self-guidance and self-improvement.<n>Our framework iteratively grounds a base robot policy to relevant objects in the environment.<n>We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates.
arXiv Detail & Related papers (2024-10-09T02:00:37Z) - GRUtopia: Dream General Robots in a City at Scale [65.08318324604116]
This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots.
GRScenes includes 100k interactive, finely annotated scenes, which can be freely combined into city-scale environments.
GRResidents is a Large Language Model (LLM) driven Non-Player Character (NPC) system that is responsible for social interaction.
arXiv Detail & Related papers (2024-07-15T17:40:46Z) - Commonsense Reasoning for Legged Robot Adaptation with Vision-Language Models [81.55156507635286]
Legged robots are physically capable of navigating a diverse variety of environments and overcoming a wide range of obstructions.
Current learning methods often struggle with generalization to the long tail of unexpected situations without heavy human supervision.
We propose a system, VLM-Predictive Control (VLM-PC), combining two key components that we find to be crucial for eliciting on-the-fly, adaptive behavior selection.
arXiv Detail & Related papers (2024-07-02T21:00:30Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - SAPIEN: A SimulAted Part-based Interactive ENvironment [77.4739790629284]
SAPIEN is a realistic and physics-rich simulated environment that hosts a large-scale set for articulated objects.
We evaluate state-of-the-art vision algorithms for part detection and motion attribute recognition as well as demonstrate robotic interaction tasks.
arXiv Detail & Related papers (2020-03-19T00:11:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.