Universal Visuo-Tactile Video Understanding for Embodied Interaction
- URL: http://arxiv.org/abs/2505.22566v1
- Date: Wed, 28 May 2025 16:43:01 GMT
- Title: Universal Visuo-Tactile Video Understanding for Embodied Interaction
- Authors: Yifan Xie, Mingyang Li, Shoujie Li, Xingting Li, Guangyu Chen, Fei Ma, Fei Richard Yu, Wenbo Ding,
- Abstract summary: We present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video understanding.<n>VTV-LLM bridges the gap between tactile perception and natural language.<n>We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation.
- Score: 16.587054862266168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.
Related papers
- ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models [50.42183477287337]
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning.<n>We introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT)<n>We show that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm.
arXiv Detail & Related papers (2025-07-14T03:21:13Z) - Tactile-VLA: Unlocking Vision-Language-Action Model's Physical Knowledge for Tactile Generalization [14.189391793395384]
This paper introduces Tactile-VLA, a novel framework that deeply fuses vision, language, action, and tactile sensing.<n> Experiments demonstrate Tactile-VLA's effectiveness and generalizability in three key aspects.
arXiv Detail & Related papers (2025-07-12T06:44:37Z) - Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - Tactile MNIST: Benchmarking Active Tactile Perception [19.93022179513013]
We introduce the Tactile MNIST Benchmark Suite, an open-source, Gymnasium-compatible benchmark for tactile perception tasks.<n>Our benchmark suite offers diverse simulation scenarios, from simple toy environments all the way to complex tactile perception tasks using vision-based tactile sensors.<n>We also offer a comprehensive dataset comprising 13,500 synthetic 3D MNIST digit models and 153,600 real-world tactile samples collected from 600 3D printed digits.
arXiv Detail & Related papers (2025-06-03T14:42:16Z) - AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors [11.506370451126378]
Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to understand and manipulate objects.<n>We introduce TacQuad, an aligned multi-modal tactile multi-sensor dataset from four different visuo-tactile sensors.<n>We propose AnyTouch, a unified static-dynamic multi-sensor representation learning framework with a multi-level structure.
arXiv Detail & Related papers (2025-02-15T08:33:25Z) - Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities.
We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z) - Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z) - Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects.
We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model.
This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z) - Elastic Tactile Simulation Towards Tactile-Visual Perception [58.44106915440858]
We propose Elastic Interaction of Particles (EIP) for tactile simulation.
EIP models the tactile sensor as a group of coordinated particles, and the elastic property is applied to regulate the deformation of particles during contact.
We further propose a tactile-visual perception network that enables information fusion between tactile data and visual images.
arXiv Detail & Related papers (2021-08-11T03:49:59Z) - Learning Intuitive Physics with Multimodal Generative Models [24.342994226226786]
This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes.
We use a novel See-Through-your-Skin (STS) sensor that provides high resolution multimodal sensing of contact surfaces.
We validate through simulated and real-world experiments in which the resting state of an object is predicted from given initial conditions.
arXiv Detail & Related papers (2021-01-12T12:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.