H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
- URL: http://arxiv.org/abs/2505.07819v2
- Date: Tue, 17 Jun 2025 08:36:51 GMT
- Title: H$^3$DP: Triply-Hierarchical Diffusion Policy for Visuomotor Learning
- Authors: Yiyang Lu, Yufeng Tian, Zhecheng Yuan, Xianbang Wang, Pu Hua, Zhengrong Xue, Huazhe Xu,
- Abstract summary: We introduce $textbfTriply-Hierarchical Diffusion Policy(textbfH$mathbf3$DP)$, a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation.<n> Extensive experiments demonstrate that H$3$DP yields a $mathbf+27.5%$ average relative improvement over baselines across $mathbf44$ simulation tasks and achieves superior performance in $mathbf4$ challenging bimanual real-world manipulation tasks.
- Score: 25.65324419553667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce $\textbf{Triply-Hierarchical Diffusion Policy}~(\textbf{H$^{\mathbf{3}}$DP})$, a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H$^{3}$DP contains $\mathbf{3}$ levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H$^{3}$DP yields a $\mathbf{+27.5\%}$ average relative improvement over baselines across $\mathbf{44}$ simulation tasks and achieves superior performance in $\mathbf{4}$ challenging bimanual real-world manipulation tasks. Project Page: https://lyy-iiis.github.io/h3dp/.
Related papers
- FLARE: Robot Learning with Implicit World Modeling [87.81846091038676]
$textbfFLARE$ integrates predictive latent world modeling into robot policy learning.<n>$textbfFLARE$ achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26%.<n>Our results establish $textbfFLARE$ as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.
arXiv Detail & Related papers (2025-05-21T15:33:27Z) - Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization [49.2338910653152]
Vision-constrained models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data.<n> Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning.<n>We propose $mathbftextttDHO$ -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-language settings.
arXiv Detail & Related papers (2025-05-12T15:39:51Z) - Occam's LGS: An Efficient Approach for Language Gaussian Splatting [57.00354758206751]
We show that the complicated pipelines for language 3D Gaussian Splatting are simply unnecessary.<n>We apply Occam's razor to the task at hand, leading to a highly efficient weighted multi-view feature aggregation technique.
arXiv Detail & Related papers (2024-12-02T18:50:37Z) - Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis [55.561961365113554]
3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in novel view synthesis (NVS)<n>In this paper, we introduce Self-Ensembling Gaussian Splatting (SE-GS)<n>We achieve self-ensembling by incorporating an uncertainty-aware perturbation strategy during training.<n> Experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets demonstrate that our approach enhances NVS quality under few-shot training conditions.
arXiv Detail & Related papers (2024-10-31T18:43:48Z) - M$^{3}$3D: Learning 3D priors using Multi-Modal Masked Autoencoders for
2D image and video understanding [5.989397492717352]
We present M$3$3D ($underlineM$ulti-$underlineM$odal $underlineM$asked $underline3D$) built based on Multi-modal masked autoencoders.
We integrate two major self-supervised learning frameworks; Masked Image Modeling (MIM) and contrastive learning.
Experiments show that M$3$3D outperforms the existing state-of-the-art approaches on ScanNet, NYUv2, UCF-101 and OR-AR.
arXiv Detail & Related papers (2023-09-26T23:52:09Z) - HDFormer: High-order Directed Transformer for 3D Human Pose Estimation [20.386530242069338]
HDFormer significantly outperforms state-of-the-art (SOTA) models on Human3.6M and MPI-INF-3DHP datasets.
HDFormer demonstrates broad real-world applicability, enabling real-time, accurate 3D pose estimation.
arXiv Detail & Related papers (2023-02-03T16:00:48Z) - Multi-Task Imitation Learning for Linear Dynamical Systems [50.124394757116605]
We study representation learning for efficient imitation learning over linear systems.
We find that the imitation gap over trajectories generated by the learned target policy is bounded by $tildeOleft( frack n_xHN_mathrmshared + frack n_uN_mathrmtargetright)$.
arXiv Detail & Related papers (2022-12-01T00:14:35Z) - D3Former: Debiased Dual Distilled Transformer for Incremental Learning [25.65032941918354]
In class incremental learning (CIL) setting, groups of classes are introduced to a model in each learning phase.
The goal is to learn a unified model performant on all the classes observed so far.
We develop a Debiased Dual Distilled Transformer for CIL dubbed $textrmD3textrmFormer$.
arXiv Detail & Related papers (2022-07-25T08:54:52Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.