A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM
- URL: http://arxiv.org/abs/2410.15549v1
- Date: Mon, 21 Oct 2024 00:36:02 GMT
- Title: A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM
- Authors: ByungOk Han, Jaehong Kim, Jinhyeok Jang,
- Abstract summary: Vision-Language-Action (VLA) models enable robots to perform complex tasks by integrating visual context with linguistic commands.
To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory.
Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates.
- Score: 0.26334346517416873
- License:
- Abstract: Vision-Language-Action (VLA) models are receiving increasing attention for their ability to enable robots to perform complex tasks by integrating visual context with linguistic commands. However, achieving efficient real-time performance remains challenging due to the high computational demands of existing models. To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory. DP-VLA utilizes a Large System 2 Model (L-Sys2) for complex reasoning and decision-making, while a Small System 1 Model (S-Sys1) handles real-time motor control and sensory processing. By leveraging Vision-Language Models (VLMs), the L-Sys2 operates at low frequencies, reducing computational overhead, while the S-Sys1 ensures fast and accurate task execution. Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates, providing a scalable solution for advanced robotic applications.
Related papers
- ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [21.844214660424175]
ChatVLA is a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference.
ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks.
Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.
arXiv Detail & Related papers (2025-02-20T10:16:18Z) - 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [2.6670748466660523]
Vision-language models (VLMs) have achieved remarkable success in scene understanding and perception tasks.
VLMs lack robust 3D scene localization capabilities, limiting their effectiveness in fine-grained robotic operations.
We propose a novel framework that integrates a 2D prompt synthesis module by mapping 2D images to point clouds, and incorporates a small language model (SLM) for supervising VLM outputs.
arXiv Detail & Related papers (2025-02-13T02:40:19Z) - DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control [7.626715427413578]
Vision-language-action (VLA) models have shown promise for generalizable robot skills.
Current VLA models often focus on scaling the vision-language model (VLM) component, while the action space representation remains a critical bottleneck.
This paper introduces DexVLA, a novel framework designed to enhance the efficiency and generalization capabilities ofVLAs for complex, long-horizon tasks.
arXiv Detail & Related papers (2025-02-09T11:25:56Z) - HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models.
We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z) - TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies [95.30717188630432]
We introduce visual trace prompting to facilitate VLA models' spatial-temporal awareness for action prediction.
We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories.
We present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset.
arXiv Detail & Related papers (2024-12-13T18:40:51Z) - CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation [100.25567121604382]
Vision-Language-Action (VLA) models have improved robotic manipulation in terms of language-guided task execution and generalization to unseen scenarios.
We present a new advanced VLA architecture derived from Vision-Language-Models (VLM)
We show that our model not only significantly surpasses existing VLAs in task performance and but also exhibits remarkable adaptation to new robots and generalization to unseen objects and backgrounds.
arXiv Detail & Related papers (2024-11-29T12:06:03Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.
Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.
We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.