Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
- URL: http://arxiv.org/abs/2403.09813v3
- Date: Mon, 17 Jun 2024 13:19:26 GMT
- Title: Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
- Authors: Ning Cheng, You Li, Jing Gao, Bin Fang, Jinan Xu, Wenjuan Han,
- Abstract summary: multimodal research related to touch focuses on visual and tactile modalities.
We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
- Score: 50.09271028495819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Fuse & Calibrate: A bi-directional Vision-Language Guided Framework for Referring Image Segmentation [8.383431263616105]
We introduce FCNet, a framework that employs a bi-directional guided fusion approach where both vision and language play guiding roles.
Specifically, we use a vision-guided approach to conduct initial multi-modal fusion, obtaining multi-modal features that focus on key vision information.
We then propose a language-guided calibration module to further calibrate these multi-modal features, ensuring they understand the context of the input sentence.
arXiv Detail & Related papers (2024-05-18T07:21:12Z) - A Touch, Vision, and Language Dataset for Multimodal Alignment [30.616909132040764]
This work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%)
We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language model for text generation using the trained encoder.
Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities.
arXiv Detail & Related papers (2024-02-20T18:47:56Z) - ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring
Instruction Tuning [24.87615615489849]
We present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region.
We propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes.
arXiv Detail & Related papers (2023-07-18T17:56:06Z) - LVP-M3: Language-aware Visual Prompt for Multilingual Multimodal Machine
Translation [94.33019040320507]
Multimodal Machine Translation (MMT) focuses on enhancing text-only translation with visual features.
Recent advances still struggle to train a separate model for each language pair, which is costly and unaffordable when the number of languages increases.
We propose the Multilingual MMT task by establishing two new Multilingual MMT benchmark datasets covering seven languages.
arXiv Detail & Related papers (2022-10-19T12:21:39Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing
Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains.
We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments.
We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.