Related papers: VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback

URL: http://arxiv.org/abs/2507.17294v2
Date: Tue, 29 Jul 2025 12:31:26 GMT
Title: VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback
Authors: Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, Harold Soh,
Abstract summary: We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
Score: 21.08021535027628
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets. We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing \emph{without fine-tuning} the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision. Code is open-sourced at \href{https://github.com/jxbi1010/VLA-Touch}{this URL}.

Related papers

Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [14.189391793395384]
This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing.<n> Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects.
arXiv Detail & Related papers (2025-07-12T06:44:37Z)
ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [54.28635581240747]
Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations.<n>ForceVLA treats external force sensing as a first-class modality within VLA systems.<n>Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control.
arXiv Detail & Related papers (2025-05-28T09:24:25Z)
TLA: Tactile-Language-Action Model for Contact-Rich Manipulation [9.97307182748107]
We introduce the Tactile-Language-Action model, which processes sequential tactile feedback via cross-modal language grounding.<n>We construct a comprehensive dataset that contains 24k pairs of tactile action instruction data, customized for fingertip peg-in-hole assembly.<n>Results show that TLA significantly outperforms traditional imitation learning methods in terms of effective action generation and action accuracy.
arXiv Detail & Related papers (2025-03-11T15:36:28Z)
Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting [25.355424080824996]
Tactile sensing is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning.<n>Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements.<n>We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models.
arXiv Detail & Related papers (2025-03-10T02:37:22Z)
Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [58.95799126311524]
Humans can accomplish contact-rich tasks using vision and touch, with highly reactive capabilities such as fast response to external changes and adaptive control of contact forces.<n>Existing visual imitation learning approaches rely on action chunking to model complex behaviors.<n>We introduce TactAR, a low-cost teleoperation system that provides real-time tactile feedback through Augmented Reality.
arXiv Detail & Related papers (2025-03-04T18:58:21Z)
OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate. We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z)
Learning Visuotactile Skills with Two Multifingered Hands [80.99370364907278]
We explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data.
arXiv Detail & Related papers (2024-04-25T17:59:41Z)
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities. We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z)
ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities. We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.