Related papers: Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset

URL: http://arxiv.org/abs/2403.09813v3
Date: Mon, 17 Jun 2024 13:19:26 GMT
Title: Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
Authors: Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han,
Abstract summary: multimodal research related to touch focuses on visual and tactile modalities. We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
Score: 50.09271028495819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.

Related papers

AeroLite: Tag-Guided Lightweight Generation of Aerial Image Captions [5.67477841586604]
textbfAeroLite is a tag-guided captioning framework for remote sensing images. textbfAeroLite leverages GPT-4o to generate a large-scale, semantically rich pseudo-caption dataset. We propose a novel bridging multilayer perceptron (MLP) architecture, aligning semantic tags with visual embeddings.
arXiv Detail & Related papers (2025-04-13T11:29:31Z)
Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries. One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs. We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z)
Improving Multilingual Neural Machine Translation by Utilizing Semantic and Linguistic Features [18.76505158652759]
We propose to exploit both semantic and linguistic features between multiple languages to enhance multilingual translation. On the encoder side, we introduce a disentangling learning task that aligns encoder representations by disentangling semantic and linguistic features. On the decoder side, we leverage a linguistic encoder to integrate low-level linguistic features to assist in the target language generation.
arXiv Detail & Related papers (2024-08-02T17:10:12Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
A Touch, Vision, and Language Dataset for Multimodal Alignment [30.616909132040764]
This work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%) We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities.
arXiv Detail & Related papers (2024-02-20T18:47:56Z)
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning [24.87615615489849]
We present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. We propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes.
arXiv Detail & Related papers (2023-07-18T17:56:06Z)
LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features. Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases. We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z)
MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z)
Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels. We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction. We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.