ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
- URL: http://arxiv.org/abs/2602.11236v1
- Date: Wed, 11 Feb 2026 16:47:01 GMT
- Title: ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning
- Authors: Yandan Yang, Shuang Zeng, Tong Lin, Xinyuan Chang, Dekang Qi, Junjin Xiao, Haoyun Liu, Ronghan Chen, Yuzhi Chen, Dongjie Huo, Feng Xiong, Xing Wei, Zhiheng Ma, Mu Xu,
- Abstract summary: ABot-M0 is a framework that builds a systematic data curation pipeline.<n>It enables end-to-end transformation of heterogeneous raw data into unified, efficient representations.<n>ABot-M0 supports modular perception via a dual-stream mechanism.
- Score: 31.000965640377128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.
Related papers
- Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z) - mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs [5.109732854501585]
We introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations.<n>Our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
arXiv Detail & Related papers (2025-12-17T18:47:31Z) - Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z) - URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z) - StaMo: Unsupervised Learning of Generalizable Robot Motion from Compact State Representation [56.996371714721995]
We propose an unsupervised approach that learns a highly compressed two-token state representation.<n>Our representation is efficient, interpretable, and integrates seamlessly into existing VLA-based models.<n>We name our method StaMo for its ability to learn generalizable robotic Motion from compact State representation.
arXiv Detail & Related papers (2025-10-06T17:37:24Z) - EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer [35.27100635173712]
Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization.<n>We propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline.<n>DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility.<n>AdaMix is a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples.
arXiv Detail & Related papers (2025-09-26T14:34:44Z) - MetaSSC: Enhancing 3D Semantic Scene Completion for Autonomous Driving through Meta-Learning and Long-sequence Modeling [3.139165705827712]
We introduce MetaSSC, a novel meta-learning-based framework for semantic scene completion (SSC)<n>Our approach begins with a voxel-based semantic segmentation (SS) pretraining task, aimed at exploring the semantics and geometry of incomplete regions.<n>Using simulated cooperative perception datasets, we supervise the perception training of a single vehicle using aggregated sensor data.<n>This meta-knowledge is then adapted to the target domain through a dual-phase training strategy, enabling efficient deployment.
arXiv Detail & Related papers (2024-11-06T05:11:25Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation.
A core challenge is to generalize the manipulation skills to objects in different locations.
We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.