ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
- URL: http://arxiv.org/abs/2505.22159v3
- Date: Thu, 18 Sep 2025 15:02:38 GMT
- Title: ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
- Authors: Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, Cewu Lu, Wenqiang Zhang,
- Abstract summary: ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
- Score: 62.58034332427291
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose ForceVLA, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces FVLMoE, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2% over strong pi_0-based baselines, achieving up to 80% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https://sites.google.com/view/forcevla2025.
Related papers
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [54.150202739999806]
LiLo-VLA is a modular framework capable of zero-shot modularity to novel long-horizon tasks without ever being trained on them.<n>We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.<n>In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%.
arXiv Detail & Related papers (2026-02-25T03:33:39Z) - FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation [8.726448573057725]
We present Force-Distilled VLA, a novel framework that integrates force awareness into contact-rich manipulation.<n>The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token.<n>During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning.
arXiv Detail & Related papers (2026-02-02T14:19:46Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation [43.83789393525928]
InstructVLA is an end-to-end vision-language model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance.<n>InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation.<n>On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA.
arXiv Detail & Related papers (2025-07-23T13:57:06Z) - VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback [21.08021535027628]
We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
arXiv Detail & Related papers (2025-07-23T07:54:10Z) - ForceGrip: Reference-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation [0.10995326465245926]
We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions.<n>We employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization.<n>Our evaluations reveal ForceGrip's superior force controllability and plausibility compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-03-11T05:39:07Z) - ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [21.844214660424175]
ChatVLA is a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference.<n>ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks.<n>Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.
arXiv Detail & Related papers (2025-02-20T10:16:18Z) - TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies [95.30717188630432]
We introduce visual trace prompting to facilitate VLA models' spatial-temporal awareness for action prediction.<n>We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories.<n>We present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset.
arXiv Detail & Related papers (2024-12-13T18:40:51Z) - Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress.
Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z) - Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.