Related papers: FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation

FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation

URL: http://arxiv.org/abs/2602.02142v1
Date: Mon, 02 Feb 2026 14:19:46 GMT
Title: FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation
Authors: Ruiteng Zhao, Wenshuo Wang, Yicheng Ma, Xiaocong Li, Francis E. H. Tay, Marcelo H. Ang, Haiyue Zhu,
Abstract summary: We present Force-Distilled VLA, a novel framework that integrates force awareness into contact-rich manipulation.<n>The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token.<n>During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning.
Score: 8.726448573057725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Force sensing is a crucial modality for Vision-Language-Action (VLA) frameworks, as it enables fine-grained perception and dexterous manipulation in contact-rich tasks. We present Force-Distilled VLA (FD-VLA), a novel framework that integrates force awareness into contact-rich manipulation without relying on physical force sensors. The core of our approach is a Force Distillation Module (FDM), which distills force by mapping a learnable query token, conditioned on visual observations and robot states, into a predicted force token aligned with the latent representation of actual force signals. During inference, this distilled force token is injected into the pretrained VLM, enabling force-aware reasoning while preserving the integrity of its vision-language semantics. This design provides two key benefits: first, it allows practical deployment across a wide range of robots that lack expensive or fragile force-torque sensors, thereby reducing hardware cost and complexity; second, the FDM introduces an additional force-vision-state fusion prior to the VLM, which improves cross-modal alignment and enhances perception-action robustness in contact-rich scenarios. Surprisingly, our physical experiments show that the distilled force token outperforms direct sensor force measurements as well as other baselines, which highlights the effectiveness of this force-distilled VLA approach.

Related papers

Symmetry-Aware Fusion of Vision and Tactile Sensing via Bilateral Force Priors for Robotic Manipulation [7.104060092661104]
We propose a Cross-Modal Transformer (CMT) for visuo-tactile fusion.<n>CMT integrates wrist-camera observations with tactile signals through structured self- and cross-attention.<n>Experiments on the TacSL benchmark show that CMT with symmetry regularization achieves a 96.59% insertion success rate.
arXiv Detail & Related papers (2026-02-14T09:19:48Z)
Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation [14.221542785249524]
We introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future.<n>Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs.<n>To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals.
arXiv Detail & Related papers (2025-12-29T21:06:33Z)
ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning [52.86018040861575]
We propose a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network.<n>We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens.<n>Experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines.
arXiv Detail & Related papers (2025-12-11T18:59:46Z)
V-Attack: Targeting Disentangled Value Features for Controllable Adversarial Attacks on LVLMs [66.81402538540458]
We propose V-Attack, a novel method for precise local semantic attacks.<n>V-Attack improves the attack success rate by an average of 36% over state-of-the-art methods.
arXiv Detail & Related papers (2025-11-25T11:51:17Z)
Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z)
Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [14.189391793395384]
This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing.<n> Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects.
arXiv Detail & Related papers (2025-07-12T06:44:37Z)
Feel the Force: Contact-Driven Learning from Humans [52.36160086934298]
Controlling fine-grained forces during manipulation remains a core challenge in robotics.<n>We present FeelTheForce, a robot learning system that models human tactile behavior to learn force-sensitive manipulation.<n>Our approach grounds robust low-level force control in scalable human supervision, achieving a 77% success rate across 5 force-sensitive manipulation tasks.
arXiv Detail & Related papers (2025-06-02T17:57:52Z)
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z)
ForceGrip: Reference-Free Curriculum Learning for Realistic Grip Force Control in VR Hand Manipulation [0.10995326465245926]
We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions.<n>We employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization.<n>Our evaluations reveal ForceGrip's superior force controllability and plausibility compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-03-11T05:39:07Z)
Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [58.95799126311524]
Humans can accomplish contact-rich tasks using vision and touch, with highly reactive capabilities such as fast response to external changes and adaptive control of contact forces.<n>Existing visual imitation learning approaches rely on action chunking to model complex behaviors.<n>We introduce TactAR, a low-cost teleoperation system that provides real-time tactile feedback through Augmented Reality.
arXiv Detail & Related papers (2025-03-04T18:58:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.