Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation
- URL: http://arxiv.org/abs/2512.09851v1
- Date: Wed, 10 Dec 2025 17:35:13 GMT
- Title: Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation
- Authors: Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu, Siyuan Huang, Yixin Zhu,
- Abstract summary: See-through-skin (STS) sensors combine tactile and visual perception.<n>Existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking.<n>We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction.
- Score: 21.78866976181311
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.
Related papers
- Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning [10.782934021703783]
MultiSensory Dynamic Pretraining (MSDP) is a framework for learning expressive multisensory representations tailored for task-oriented policy learning.<n>MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings.<n>For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings.
arXiv Detail & Related papers (2025-11-18T12:32:23Z) - Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper [7.618517580705364]
We present a portable, lightweight gripper with integrated tactile sensors.<n>We propose a cross-modal representation learning framework that integrates visual and tactile signals.<n>We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer.
arXiv Detail & Related papers (2025-07-20T17:53:59Z) - Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [58.95799126311524]
Humans can accomplish contact-rich tasks using vision and touch, with highly reactive capabilities such as fast response to external changes and adaptive control of contact forces.<n>Existing visual imitation learning approaches rely on action chunking to model complex behaviors.<n>We introduce TactAR, a low-cost teleoperation system that provides real-time tactile feedback through Augmented Reality.
arXiv Detail & Related papers (2025-03-04T18:58:21Z) - AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors [11.506370451126378]
Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to understand and manipulate objects.<n>We introduce TacQuad, an aligned multi-modal tactile multi-sensor dataset from four different visuo-tactile sensors.<n>We propose AnyTouch, a unified static-dynamic multi-sensor representation learning framework with a multi-level structure.
arXiv Detail & Related papers (2025-02-15T08:33:25Z) - Learning Precise, Contact-Rich Manipulation through Uncalibrated Tactile Skins [17.412763585521688]
We present the Visuo-Skin (ViSk) framework, a simple approach that uses a transformer-based policy and treats skin sensor data as additional tokens alongside visual information.
ViSk significantly outperforms both vision-only and optical tactile sensing based policies.
Further analysis reveals that combining tactile and visual modalities enhances policy performance and spatial generalization, achieving an average improvement of 27.5% across tasks.
arXiv Detail & Related papers (2024-10-22T17:59:49Z) - Learning Visuotactile Skills with Two Multifingered Hands [80.99370364907278]
We explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data.
Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data.
arXiv Detail & Related papers (2024-04-25T17:59:41Z) - Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor [14.492202828369127]
We leverage a multimodal visuotactile sensor within the framework of imitation learning (IL) to perform contact-rich tasks.<n>We introduce two algorithmic contributions, tactile force matching and learned mode switching, as complimentary methods for improving IL.<n>Our results show that the inclusion of force matching raises average policy success rates by 62.5%, visuotactile mode switching by 30.3%, and visuotactile data as a policy input by 42.5%.
arXiv Detail & Related papers (2023-11-02T14:02:42Z) - The Power of the Senses: Generalizable Manipulation from Vision and
Touch through Masked Multimodal Learning [60.91637862768949]
We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting.
M3L learns a policy and visual-tactile representations based on masked autoencoding.
We evaluate M3L on three simulated environments with both visual and tactile observations.
arXiv Detail & Related papers (2023-11-02T01:33:00Z) - Dexterous Manipulation from Images: Autonomous Real-World RL via Substep
Guidance [71.36749876465618]
We describe a system for vision-based dexterous manipulation that provides a "programming-free" approach for users to define new tasks.
Our system includes a framework for users to define a final task and intermediate sub-tasks with image examples.
experimental results with a four-finger robotic hand learning multi-stage object manipulation tasks directly in the real world.
arXiv Detail & Related papers (2022-12-19T22:50:40Z) - See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation [49.925499720323806]
We study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks.
We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor.
arXiv Detail & Related papers (2022-12-07T18:55:53Z) - Learning to Detect Slip with Barometric Tactile Sensors and a Temporal
Convolutional Neural Network [7.346580429118843]
We present a learning-based method to detect slip using barometric tactile sensors.
We train a temporal convolution neural network to detect slip, achieving high detection accuracies.
We argue that barometric tactile sensing technology, combined with data-driven learning, is suitable for many manipulation tasks such as slip compensation.
arXiv Detail & Related papers (2022-02-19T08:21:56Z) - OmniTact: A Multi-Directional High Resolution Touch Sensor [109.28703530853542]
Existing tactile sensors are either flat, have small sensitive fields or only provide low-resolution signals.
We introduce OmniTact, a multi-directional high-resolution tactile sensor.
We evaluate the capabilities of OmniTact on a challenging robotic control task.
arXiv Detail & Related papers (2020-03-16T01:31:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.