HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
- URL: http://arxiv.org/abs/2410.05273v2
- Date: Mon, 21 Oct 2024 06:50:05 GMT
- Title: HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers
- Authors: Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, Jianyu Chen,
- Abstract summary: Large Vision-Language-Action (VLA) models have shown promise in robotic control due to their impressive generalization ability.
Their reliance on VLM backends with billions of parameters leads to high computational costs and latency inference.
This paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off.
- Score: 12.373320641721344
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language-Action (VLA) models, leveraging powerful pre trained Vision-Language Models (VLMs) backends, have shown promise in robotic control due to their impressive generalization ability. However, the success comes at a cost. Their reliance on VLM backends with billions of parameters leads to high computational costs and inference latency, limiting the testing scenarios to mainly quasi-static tasks and hindering performance in dynamic tasks requiring rapid interactions. To address these limitations, this paper proposes HiRT, a Hierarchical Robot Transformer framework that enables flexible frequency and performance trade-off. HiRT keeps VLMs running at low frequencies to capture temporarily invariant features while enabling real-time interaction through a high-frequency vision-based policy guided by the slowly updated features. Experiment results in both simulation and real-world settings demonstrate significant improvements over baseline methods. Empirically, in static tasks, we double the control frequency and achieve comparable success rates. Additionally, on novel real-world dynamic ma nipulation tasks which are challenging for previous VLA models, HiRT improves the success rate from 48% to 75%.
Related papers
- Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation [80.71541671907426]
OneStep Diffusion Policy (OneDP) is a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator.
OneDP significantly accelerates response times for robotic control tasks.
arXiv Detail & Related papers (2024-10-28T17:54:31Z) - A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM [0.26334346517416873]
Vision-Language-Action (VLA) models enable robots to perform complex tasks by integrating visual context with linguistic commands.
To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory.
Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates.
arXiv Detail & Related papers (2024-10-21T00:36:02Z) - Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.
Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.
We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning [22.99690700210957]
We propose a novel HRL framework that leverages language instructions to generate a stationary reward function for a higher-level policy.
Since the language-guided reward is unaffected by the lower primitive behaviour, LGR2 mitigates non-stationarity.
Our approach attains success rates exceeding 70$%$ in challenging, sparse-reward robotic navigation and manipulation environments.
arXiv Detail & Related papers (2024-06-09T18:40:24Z) - Learning Low-Frequency Motion Control for Robust and Dynamic Robot
Locomotion [10.838285018473725]
We demonstrate robust and dynamic locomotion with a learned motion controller executing at as low as 8 Hz on a real ANYmal C quadruped.
The robot is able to robustly and repeatably achieve a high heading velocity of 1.5 m/s, traverse uneven terrain, and resist unexpected external perturbations.
arXiv Detail & Related papers (2022-09-29T15:55:33Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.