Related papers: ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model

ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model

URL: http://arxiv.org/abs/2506.19842v1
Date: Tue, 24 Jun 2025 17:59:06 GMT
Title: ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model
Authors: Tengbo Yu, Guanxing Lu, Zaijia Yang, Haoyuan Deng, Season Si Chen, Jiwen Lu, Wenbo Ding, Guoqiang Hu, Yansong Tang, Ziwei Wang,
Abstract summary: We propose an extension of ManiGaussian framework that improves bimanual manipulation by digesting multi-task scene dynamics through a hierarchical world model.<n>Our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and 60% success rate on average in 9 challenging real-world tasks.
Score: 52.02220087880269
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual-arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi-body spatiotemporal dynamics. An existing method ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single-arm settings, which ignores the interaction of multiple embodiments for dual-arm systems with significant performance drop. In this paper, we propose ManiGaussian++, an extension of ManiGaussian framework that improves multi-task bimanual manipulation by digesting multi-body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task-oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader-follower architecture, where the multi-body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real-world tasks. Our code is available at https://github.com/April-Yz/ManiGaussian_Bimanual.

Related papers

MinD: Unified Visual Imagination and Control via Hierarchical World Models [32.08769443927576]
Video generation models (VGMs) offer a promising pathway for unified world modeling in robotics.<n>Manipulate in Dream (MinD) is a hierarchical diffusion-based world model framework that employs a dual-system design for vision-language manipulation.<n>MinD executes VGM at low frequencies to extract video prediction features, while leveraging a high-frequency diffusion policy for real-time interaction.
arXiv Detail & Related papers (2025-06-23T17:59:06Z)
GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models [22.826115023573205]
We infuse the predictive nature of human manipulation strategies into robot imitation learning.<n>We train a diffusion model to predict future states and compute robot actions that achieve the predicted states.<n>Our framework consistently outperforms state-of-the-art state-to-action mapping policies.
arXiv Detail & Related papers (2025-03-30T01:25:35Z)
ManiGaussian: Dynamic Gaussian Splatting for Multi-task Robotic Manipulation [58.615616224739654]
Conventional robotic manipulation methods usually learn semantic representation of the observation for prediction. We propose a dynamic Gaussian Splatting method named ManiGaussian for multi-temporal robotic manipulation. Our framework can outperform the state-of-the-art methods by 13.1% in average success rate.
arXiv Detail & Related papers (2024-03-13T08:06:41Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Learning to Shift Attention for Motion Generation [55.61994201686024]
One challenge of motion generation using robot learning from demonstration techniques is that human demonstrations follow a distribution with multiple modes for one task query. Previous approaches fail to capture all modes or tend to average modes of the demonstrations and thus generate invalid trajectories. We propose a motion generation model with extrapolation ability to overcome this problem.
arXiv Detail & Related papers (2021-02-24T09:07:52Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.