Related papers: Touch begins where vision ends: Generalizable policies for contact-rich manipulation

Touch begins where vision ends: Generalizable policies for contact-rich manipulation

URL: http://arxiv.org/abs/2506.13762v1
Date: Mon, 16 Jun 2025 17:59:48 GMT
Title: Touch begins where vision ends: Generalizable policies for contact-rich manipulation
Authors: Zifan Zhao, Siddhant Haldar, Jinda Cui, Lerrel Pinto, Raunaq Bhirangi,
Abstract summary: We introduce VisuoTactile Local (ViTaL) policy learning, a framework that solves fine-grained manipulation tasks.<n>ViTaL achieves around 90% success on contact-rich tasks in unseen environments.
Score: 18.195865256382334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-driven approaches struggle with precise manipulation; imitation learning requires many hard-to-obtain demonstrations, while reinforcement learning yields brittle, non-generalizable policies. We introduce VisuoTactile Local (ViTaL) policy learning, a framework that solves fine-grained manipulation tasks by decomposing them into two phases: a reaching phase, where a vision-language model (VLM) enables scene-level reasoning to localize the object of interest, and a local interaction phase, where a reusable, scene-agnostic ViTaL policy performs contact-rich manipulation using egocentric vision and tactile sensing. This approach is motivated by the observation that while scene context varies, the low-level interaction remains consistent across task instances. By training local policies once in a canonical setting, they can generalize via a localize-then-execute strategy. ViTaL achieves around 90% success on contact-rich tasks in unseen environments and is robust to distractors. ViTaL's effectiveness stems from three key insights: (1) foundation models for segmentation enable training robust visual encoders via behavior cloning; (2) these encoders improve the generalizability of policies learned using residual RL; and (3) tactile sensing significantly boosts performance in contact-rich tasks. Ablation studies validate each of these insights, and we demonstrate that ViTaL integrates well with high-level VLMs, enabling robust, reusable low-level skills. Results and videos are available at https://vitalprecise.github.io.

Related papers

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z)
VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback [21.08021535027628]
We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
arXiv Detail & Related papers (2025-07-23T07:54:10Z)
Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [14.189391793395384]
This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing.<n> Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects.
arXiv Detail & Related papers (2025-07-12T06:44:37Z)
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers [7.505873965164197]
We introduce ViTaPEs, a framework to learn task-agnostic representations for visuotactile perception.<n>Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures.<n>We show that ViTaPEs surpasses state-of-the-art baselines across various recognition tasks.
arXiv Detail & Related papers (2025-05-26T14:19:29Z)
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning [27.511627003202538]
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes.<n>This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes.<n>We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components.
arXiv Detail & Related papers (2025-05-14T04:04:23Z)
Vision Language Models are In-Context Value Learners [89.29486557646624]
We present Generative Value Learning (GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks.
arXiv Detail & Related papers (2024-11-07T09:17:50Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities. We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks. GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks. We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z)
Learning Object Relation Graph and Tentative Policy for Visual Navigation [44.247995617796484]
It is critical to learn informative visual representation and robust navigation policy. This paper proposes three complementary techniques, object relation graph (ORG), trial-driven imitation learning (IL), and a memory-augmented tentative policy network (TPN) We report 22.8% and 23.5% increase in success rate and Success weighted by Path Length (SPL)
arXiv Detail & Related papers (2020-07-21T18:03:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.