VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback
- URL: http://arxiv.org/abs/2507.17294v2
- Date: Tue, 29 Jul 2025 12:31:26 GMT
- Title: VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback
- Authors: Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Zheng Shou, Harold Soh,
- Abstract summary: We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
- Score: 21.08021535027628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tactile feedback is generally recognized to be crucial for effective interaction with the physical world. However, state-of-the-art Vision-Language-Action (VLA) models lack the ability to interpret and use tactile signals, limiting their effectiveness in contact-rich tasks. Incorporating tactile feedback into these systems is challenging due to the absence of large multi-modal datasets. We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing \emph{without fine-tuning} the base VLA. Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation. Through real-world experiments, we demonstrate that our dual-level integration of tactile feedback improves task planning efficiency while enhancing execution precision. Code is open-sourced at \href{https://github.com/jxbi1010/VLA-Touch}{this URL}.
Related papers
- Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [14.189391793395384]
This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing.<n> Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects.
arXiv Detail & Related papers (2025-07-12T06:44:37Z) - ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [54.28635581240747]
Vision-Language-Action (VLA) models have advanced general-purpose robotic manipulation by leveraging pretrained visual and linguistic representations.<n>ForceVLA treats external force sensing as a first-class modality within VLA systems.<n>Our approach highlights the importance of multimodal integration for dexterous manipulation and sets a new benchmark for physically intelligent robotic control.
arXiv Detail & Related papers (2025-05-28T09:24:25Z) - TLA: Tactile-Language-Action Model for Contact-Rich Manipulation [9.97307182748107]
We introduce the Tactile-Language-Action model, which processes sequential tactile feedback via cross-modal language grounding.<n>We construct a comprehensive dataset that contains 24k pairs of tactile action instruction data, customized for fingertip peg-in-hole assembly.<n>Results show that TLA significantly outperforms traditional imitation learning methods in terms of effective action generation and action accuracy.
arXiv Detail & Related papers (2025-03-11T15:36:28Z) - Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting [25.355424080824996]
Tactile sensing is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning.<n>Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements.<n>We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models.
arXiv Detail & Related papers (2025-03-10T02:37:22Z) - Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [58.95799126311524]
Humans can accomplish contact-rich tasks using vision and touch, with highly reactive capabilities such as fast response to external changes and adaptive control of contact forces.<n>Existing visual imitation learning approaches rely on action chunking to model complex behaviors.<n>We introduce TactAR, a low-cost teleoperation system that provides real-time tactile feedback through Augmented Reality.
arXiv Detail & Related papers (2025-03-04T18:58:21Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - Learning Visuotactile Skills with Two Multifingered Hands [80.99370364907278]
We explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data.
Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data.
arXiv Detail & Related papers (2024-04-25T17:59:41Z) - Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities.
We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z) - ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.