Related papers: ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation

URL: http://arxiv.org/abs/2505.22159v3
Date: Thu, 18 Sep 2025 15:02:38 GMT
Title: ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation
Authors: Jiawen Yu, Hairuo Liu, Qiaojun Yu, Jieji Ren, Ce Hao, Haitong Ding, Guangyu Huang, Guofan Huang, Yan Song, Panpan Cai, Cewu Lu, Wenqiang Zhang,
Abstract summary: ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
Score: 62.58034332427291
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations. However, they struggle with contact-rich tasks that require fine-grained control involving force, especially under visual occlusion or dynamic uncertainty. To address these limitations, we propose ForceVLA, a novel end-to-end manipulation framework that treats external force sensing as a first-class modality within VLA systems. ForceVLA introduces FVLMoE, a force-aware Mixture-of-Experts fusion module that dynamically integrates pretrained visual-language embeddings with real-time 6-axis force feedback during action decoding. This enables context-aware routing across modality-specific experts, enhancing the robot's ability to adapt to subtle contact dynamics. We also introduce \textbf{ForceVLA-Data}, a new dataset comprising synchronized vision, proprioception, and force-torque signals across five contact-rich manipulation tasks. ForceVLA improves average task success by 23.2% over strong pi_0-based baselines, achieving up to 80% success in tasks such as plug insertion. Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control. Code and data will be released at https://sites.google.com/view/forcevla2025.

Related papers

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies [54.150202739999806]
LiLo-VLA is a modular framework capable of zero-shot modularity to novel long-horizon tasks without ever being trained on them.<n>We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.<n>In these simulations, LiLo-VLA achieves a 69% average success rate, outperforming Pi0.5 by 41% and OpenVLA-OFT by 67%.
arXiv Detail & Related papers (2026-02-25T03:33:39Z)
FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation [8.726448573057725]
We present Force-Distilled VLA, a novel framework that integrates force awareness into contact-rich manipulation.<n>The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token.<n>During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning.
arXiv Detail & Related papers (2026-02-02T14:19:46Z)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z)
InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation [43.83789393525928]
InstructVLA is an end-to-end vision-language model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance.<n>InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation.<n>On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA.
arXiv Detail & Related papers (2025-07-23T13:57:06Z)
VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback [21.08021535027628]
We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
arXiv Detail & Related papers (2025-07-23T07:54:10Z)
ForceGrip: Reference-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation [0.10995326465245926]
We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions.<n>We employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization.<n>Our evaluations reveal ForceGrip's superior force controllability and plausibility compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-03-11T05:39:07Z)
ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [21.844214660424175]
ChatVLA is a novel framework featuring Phased Alignment Training, which incrementally integrates multimodal data after initial control mastery, and a Mixture-of-Experts architecture to minimize task interference.<n>ChatVLA demonstrates competitive performance on visual question-answering datasets and significantly surpasses state-of-the-art vision-language-action (VLA) methods on multimodal understanding benchmarks.<n>Our findings highlight the potential of our unified framework for achieving both robust multimodal understanding and effective robot control.
arXiv Detail & Related papers (2025-02-20T10:16:18Z)
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies [95.30717188630432]
We introduce visual trace prompting to facilitate VLA models' spatial-temporal awareness for action prediction.<n>We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories.<n>We present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset.
arXiv Detail & Related papers (2024-12-13T18:40:51Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
Robotic Control via Embodied Chain-of-Thought Reasoning [86.6680905262442]
Key limitation of learned robot control policies is their inability to generalize outside their training data.<n>Recent works on vision-language-action models (VLAs) have shown that the use of large, internet pre-trained vision-language models can substantially improve their robustness and generalization ability.<n>We introduce Embodied Chain-of-Thought Reasoning (ECoT) for VLAs, in which we train VLAs to perform multiple steps of reasoning about plans, sub-tasks, motions, and visually grounded features before predicting the robot action.
arXiv Detail & Related papers (2024-07-11T17:31:01Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate. We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.