A Touch, Vision, and Language Dataset for Multimodal Alignment
- URL: http://arxiv.org/abs/2402.13232v1
- Date: Tue, 20 Feb 2024 18:47:56 GMT
- Title: A Touch, Vision, and Language Dataset for Multimodal Alignment
- Authors: Letian Fu and Gaurav Datta and Huang Huang and William Chung-Ho
Panitch and Jaimyn Drake and Joseph Ortiz and Mustafa Mukadam and Mike
Lambeta and Roberto Calandra and Ken Goldberg
- Abstract summary: This work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%)
We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language model for text generation using the trained encoder.
Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities.
- Score: 30.616909132040764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Touch is an important sensing modality for humans, but it has not yet been
incorporated into a multimodal generative language model. This is partially due
to the difficulty of obtaining natural language labels for tactile data and the
complexity of aligning tactile readings with both visual observations and
language descriptions. As a step towards bridging that gap, this work
introduces a new dataset of 44K in-the-wild vision-touch pairs, with English
language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V
(90%). We use this dataset to train a vision-language-aligned tactile encoder
for open-vocabulary classification and a touch-vision-language (TVL) model for
text generation using the trained encoder. Results suggest that by
incorporating touch, the TVL model improves (+29% classification accuracy)
touch-vision-language alignment over existing models trained on any pair of
those modalities. Although only a small fraction of the dataset is
human-labeled, the TVL model demonstrates improved visual-tactile understanding
over GPT-4V (+12%) and open-source vision-language models (+32%) on a new
touch-vision understanding benchmark. Code and data:
https://tactile-vlm.github.io.
Related papers
- TextToucher: Fine-Grained Text-to-Touch Generation [20.49021594738016]
We analyze the characteristics of tactile images in detail from two granularities: object-level (tactile texture, tactile shape), and sensor-level (gel status)
We propose a fine-grained Text-to-Touch generation method (TextToucher) to generate high-quality tactile samples.
arXiv Detail & Related papers (2024-09-09T08:26:47Z) - Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities.
We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z) - Linguistic More: Taking a Further Step toward Efficient and Accurate
Scene Text Recognition [92.6211155264297]
Vision models have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task.
Recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper.
We propose a $textbfL$inguistic $textbfP$erception $textbfV$ision model (LPV) which explores the linguistic capability of vision model for accurate text recognition.
arXiv Detail & Related papers (2023-05-09T02:52:47Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Robotic Skill Acquisition via Instruction Augmentation with
Vision-Language Models [70.82705830137708]
We introduce Data-driven Instruction Augmentation for Language-conditioned control (DIAL)
We utilize semi-language labels leveraging the semantic understanding of CLIP to propagate knowledge onto large datasets of unlabelled demonstration data.
DIAL enables imitation learning policies to acquire new capabilities and generalize to 60 novel instructions unseen in the original dataset.
arXiv Detail & Related papers (2022-11-21T18:56:00Z) - Tactile-ViewGCN: Learning Shape Descriptor from Tactile Data using Graph
Convolutional Network [0.4189643331553922]
It focuses on improving previous works on object classification using tactile data.
We propose a novel method, dubbed as Tactile-ViewGCN, that hierarchically aggregate tactile features.
Our model outperforms previous methods on the STAG dataset with an accuracy of 81.82%.
arXiv Detail & Related papers (2022-03-12T05:58:21Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.