Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
- URL: http://arxiv.org/abs/2512.23864v1
- Date: Mon, 29 Dec 2025 21:06:33 GMT
- Title: Learning to Feel the Future: DreamTacVLA for Contact-Rich Manipulation
- Authors: Guo Ye, Zexi Zhang, Xu Zhao, Shang Wu, Haoran Lu, Shihan Lu, Han Liu,
- Abstract summary: We introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future.<n>Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs.<n>To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals.
- Score: 14.221542785249524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language-Action (VLA) models have shown remarkable generalization by mapping web-scale knowledge to robotic control, yet they remain blind to physical contact. Consequently, they struggle with contact-rich manipulation tasks that require reasoning about force, texture, and slip. While some approaches incorporate low-dimensional tactile signals, they fail to capture the high-resolution dynamics essential for such interactions. To address this limitation, we introduce DreamTacVLA, a framework that grounds VLA models in contact physics by learning to feel the future. Our model adopts a hierarchical perception scheme in which high-resolution tactile images serve as micro-vision inputs coupled with wrist-camera local vision and third-person macro vision. To reconcile these multi-scale sensory streams, we first train a unified policy with a Hierarchical Spatial Alignment (HSA) loss that aligns tactile tokens with their spatial counterparts in the wrist and third-person views. To further deepen the model's understanding of fine-grained contact dynamics, we finetune the system with a tactile world model that predicts future tactile signals. To mitigate tactile data scarcity and the wear-prone nature of tactile sensors, we construct a hybrid large-scale dataset sourced from both high-fidelity digital twin and real-world experiments. By anticipating upcoming tactile states, DreamTacVLA acquires a rich model of contact physics and conditions its actions on both real observations and imagined consequences. Across contact-rich manipulation tasks, it outperforms state-of-the-art VLA baselines, achieving up to 95% success, highlighting the importance of understanding physical contact for robust, touch-aware robotic agents.
Related papers
- AnyTouch 2: General Optical Tactile Representation Learning For Dynamic Tactile Perception [25.926187751777903]
We present ToucHD, a large-scale hierarchical tactile dataset spanning tactile atomic actions, real-world manipulations, and touch-force paired data.<n>We propose AnyTouch 2, a general tactile representation learning framework for diverse optical tactile sensors that unifies object-level understanding with fine-grained, force-aware dynamic perception.
arXiv Detail & Related papers (2026-02-10T10:05:53Z) - OPENTOUCH: Bringing Full-Hand Touch to Real-World Interaction [93.88239833545623]
We present OpenTouch, the first in-the-wild egocentric full-hand tactile dataset.<n>We show that tactile signals provide a compact yet powerful cue for grasp understanding.<n>We aim to advance multimodal egocentric perception, embodied learning, and contact-rich robotic manipulation.
arXiv Detail & Related papers (2025-12-18T18:18:17Z) - VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback [21.08021535027628]
We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
arXiv Detail & Related papers (2025-07-23T07:54:10Z) - Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [14.189391793395384]
This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing.<n> Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects.
arXiv Detail & Related papers (2025-07-12T06:44:37Z) - ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z) - Digitizing Touch with an Artificial Multimodal Fingertip [51.7029315337739]
Humans and robots both benefit from using touch to perceive and interact with the surrounding environment.
Here, we describe several conceptual and technological innovations to improve the digitization of touch.
These advances are embodied in an artificial finger-shaped sensor with advanced sensing capabilities.
arXiv Detail & Related papers (2024-11-04T18:38:50Z) - Controllable Visual-Tactile Synthesis [28.03469909285511]
We develop a conditional generative model that synthesizes both visual and tactile outputs from a single sketch.
We then introduce a pipeline to render high-quality visual and tactile outputs on an electroadhesion-based haptic device.
arXiv Detail & Related papers (2023-05-04T17:59:51Z) - Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects.
We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model.
This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z) - Elastic Tactile Simulation Towards Tactile-Visual Perception [58.44106915440858]
We propose Elastic Interaction of Particles (EIP) for tactile simulation.
EIP models the tactile sensor as a group of coordinated particles, and the elastic property is applied to regulate the deformation of particles during contact.
We further propose a tactile-visual perception network that enables information fusion between tactile data and visual images.
arXiv Detail & Related papers (2021-08-11T03:49:59Z) - Learning Intuitive Physics with Multimodal Generative Models [24.342994226226786]
This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes.
We use a novel See-Through-your-Skin (STS) sensor that provides high resolution multimodal sensing of contact surfaces.
We validate through simulated and real-world experiments in which the resting state of an object is predicted from given initial conditions.
arXiv Detail & Related papers (2021-01-12T12:55:53Z) - COCOI: Contact-aware Online Context Inference for Generalizable
Non-planar Pushing [87.7257446869134]
General contact-rich manipulation problems are long-standing challenges in robotics.
Deep reinforcement learning has shown great potential in solving robot manipulation tasks.
We propose COCOI, a deep RL method that encodes a context embedding of dynamics properties online.
arXiv Detail & Related papers (2020-11-23T08:20:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.