Related papers: Galaxea Open-World Dataset and G0 Dual-System VLA Model

Galaxea Open-World Dataset and G0 Dual-System VLA Model

URL: http://arxiv.org/abs/2509.00576v1
Date: Sat, 30 Aug 2025 18:04:19 GMT
Title: Galaxea Open-World Dataset and G0 Dual-System VLA Model
Authors: Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, Hang Zhao,
Abstract summary: We present a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments.<n>All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations.<n>G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training.
Score: 55.756245350141675
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.

Related papers

Universal Pose Pretraining for Generalizable Vision-Language-Action Policies [83.39008378156647]
Existing Vision-Language-Action (VLA) models often suffer from feature collapse and low training efficiency.<n>We propose Pose-VLA, a decoupled paradigm that separates VLA training into a pre-training phase for extracting universal 3D spatial priors.<n>Our framework follows a two-stage pre-training pipeline, establishing fundamental spatial grounding via poses followed by motion alignment.
arXiv Detail & Related papers (2026-02-23T11:00:08Z)
RDT2: Exploring the Scaling Limit of UMI Data Towards Zero-Shot Cross-Embodiment Generalization [31.40401674436269]
We introduce RDT2, a robotic foundation model built upon a 7B parameter VLM to enable zero-shot deployment on novel embodiments for open-vocabulary tasks.<n>We collected one of the largest open-source robotic datasets--over 10,000 hours of demonstrations in diverse families--using an enhanced, embodiment-agnostic Universal Manipulation Interface (UMI)<n>Our approach employs a novel three-stage training recipe that aligns discrete linguistic knowledge with continuous control via Residual Vector Quantization (RVQ), flow-matching, and distillation for real-time inference.
arXiv Detail & Related papers (2026-02-03T09:38:23Z)
iFlyBot-VLA Technical Report [25.330744626382977]
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework.<n>The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; and (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets.
arXiv Detail & Related papers (2025-11-01T06:24:56Z)
Latent Action Pretraining Through World Modeling [1.988007188564225]
We propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way.<n>Our framework is designed to be effective for transferring across tasks, environments, and embodiments.
arXiv Detail & Related papers (2025-09-22T21:19:10Z)
Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.62109400603394]
We introduce Being-H0, a dexterous Vision-Language-Action model trained on large-scale human videos.<n>Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks.<n>We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes.
arXiv Detail & Related papers (2025-07-21T13:19:09Z)
DexVLG: Dexterous Vision-Language-Grasp Model at Scale [59.5613919093295]
There is little research on functional grasping with large models for human-like dexterous hands.<n>We introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions.<n>We generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions.
arXiv Detail & Related papers (2025-07-03T16:05:25Z)
Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation [62.711546725154314]
We introduce Gondola, a grounded vision-language planning model based on large language models (LLMs) for generalizable robotic manipulation.<n>G Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations.<n>G Gondola outperforms the state-of-the-art LLM-based method across all four levels of the GemBench dataset.
arXiv Detail & Related papers (2025-06-12T20:04:31Z)
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions [32.83715417294052]
UniVLA is a new framework for learning cross-embodiment vision-language-action (VLA) policies.<n>We derive task-centric action representations from videos with a latent action model.<n>We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments.
arXiv Detail & Related papers (2025-05-09T15:11:13Z)
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control [7.626715427413578]
Vision-language-action (VLA) models have shown promise for generalizable robot skills.<n>Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck.<n>This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities ofVLAs for complex, long-horizon tasks.
arXiv Detail & Related papers (2025-02-09T11:25:56Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model. Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.