Related papers: TLA: Tactile-Language-Action Model for Contact-Rich Manipulation

TLA: Tactile-Language-Action Model for Contact-Rich Manipulation

URL: http://arxiv.org/abs/2503.08548v1
Date: Tue, 11 Mar 2025 15:36:28 GMT
Title: TLA: Tactile-Language-Action Model for Contact-Rich Manipulation
Authors: Peng Hao, Chaofan Zhang, Dingzhe Li, Xiaoge Cao, Xiaoshuai Hao, Shaowei Cui, Shuo Wang,
Abstract summary: We introduce the Tactile-Language-Action model, which processes sequential tactile feedback via cross-modal language grounding.<n>We construct a comprehensive dataset that contains 24k pairs of tactile action instruction data, customized for fingertip peg-in-hole assembly.<n>Results show that TLA significantly outperforms traditional imitation learning methods in terms of effective action generation and action accuracy.
Score: 9.97307182748107
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Significant progress has been made in vision-language models. However, language-conditioned robotic manipulation for contact-rich tasks remains underexplored, particularly in terms of tactile sensing. To address this gap, we introduce the Tactile-Language-Action (TLA) model, which effectively processes sequential tactile feedback via cross-modal language grounding to enable robust policy generation in contact-intensive scenarios. In addition, we construct a comprehensive dataset that contains 24k pairs of tactile action instruction data, customized for fingertip peg-in-hole assembly, providing essential resources for TLA training and evaluation. Our results show that TLA significantly outperforms traditional imitation learning methods (e.g., diffusion policy) in terms of effective action generation and action accuracy, while demonstrating strong generalization capabilities by achieving over 85\% success rate on previously unseen assembly clearances and peg shapes. We publicly release all data and code in the hope of advancing research in language-conditioned tactile manipulation skill learning. Project website: https://sites.google.com/view/tactile-language-action/

Related papers

Indian Sign Language Detection for Real-Time Translation using Machine Learning [0.1747623282473278]
We propose a robust, real-time ISL detection & translation system built upon a Convolutional Neural Network (CNN)<n>Our model is trained on a comprehensive ISL dataset & demonstrates exceptional performance, achieving a classification accuracy of 99.95%.<n>For real-time implementation, the framework integrates MediaPipe for precise hand tracking & motion detection, enabling seamless translation of dynamic gestures.
arXiv Detail & Related papers (2025-07-27T21:15:46Z)
VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback [21.08021535027628]
We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
arXiv Detail & Related papers (2025-07-23T07:54:10Z)
Improving Pinterest Search Relevance Using Large Language Models [15.24121687428178]
We integrate Large Language Models (LLMs) into our search relevance model. Our approach uses search queries alongside content representations that include captions extracted from a generative visual language model. We distill from the LLM-based model into real-time servable model architectures and features.
arXiv Detail & Related papers (2024-10-22T16:29:33Z)
Deep Neural Network-Based Sign Language Recognition: A Comprehensive Approach Using Transfer Learning with Explainability [0.0]
We suggest a novel solution that uses a deep neural network to fully automate sign language recognition. This methodology integrates sophisticated preprocessing methodologies to optimise the overall performance. Our model's ability to provide informational clarity was assessed using the SHAP (SHapley Additive exPlanations) method.
arXiv Detail & Related papers (2024-09-11T17:17:44Z)
KIF: Knowledge Identification and Fusion for Language Model Continual Learning [41.28933724210434]
We introduce a novel framework for language models, named Knowledge Identification and Fusion (KIF)<n>KIF segregates the model into'skill units' based on parameter dependencies, allowing for more precise control.<n>It employs a novel group-wise knowledge identification technique to ascertain the importance distribution of skill units for a new task.<n>As a result, KIF achieves an optimal balance between retaining prior knowledge and excelling in new tasks.
arXiv Detail & Related papers (2024-08-09T17:44:45Z)
CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning [4.004641316826348]
We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT) Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets. The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.
arXiv Detail & Related papers (2024-07-30T17:57:32Z)
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities. We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z)
Cross-modality Data Augmentation for End-to-End Sign Language Translation [66.46877279084083]
End-to-end sign language translation (SLT) aims to convert sign language videos into spoken language texts directly without intermediate representations. It has been a challenging task due to the modality gap between sign videos and texts and the data scarcity of labeled data. We propose a novel Cross-modality Data Augmentation (XmDA) framework to transfer the powerful gloss-to-text translation capabilities to end-to-end sign language translation.
arXiv Detail & Related papers (2023-05-18T16:34:18Z)
Robotic Skill Acquisition via Instruction Augmentation with Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL) We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data. DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z)
Improving Policy Learning via Language Dynamics Distillation [87.27583619910338]
We propose Language Dynamics Distillation (LDD), which pretrains a model to predict environment dynamics given demonstrations with language descriptions. We show that language descriptions in demonstrations improve sample-efficiency and generalization across environments.
arXiv Detail & Related papers (2022-09-30T19:56:04Z)
Generative Action Description Prompts for Skeleton-based Action Recognition [15.38417530693649]
We propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition. We employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions. Our proposed GAP method achieves noticeable improvements over various baseline models without extra cost at inference.
arXiv Detail & Related papers (2022-08-10T12:55:56Z)
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.