Related papers: Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations

Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations

URL: http://arxiv.org/abs/2306.02348v1
Date: Sun, 4 Jun 2023 12:53:12 GMT
Title: Leverage Points in Modality Shifts: Comparing Language-only and Multimodal Word Representations
Authors: Aleksey Tikhonov, Lisa Bylinina, Denis Paperno
Abstract summary: Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. Our paper compares word embeddings from three vision-and-language models and three text-only models, with static and contextual representations. This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters.
Score: 0.8594140167290097
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal embeddings aim to enrich the semantic information in neural representations of language compared to text-only models. While different embeddings exhibit different applicability and performance on downstream tasks, little is known about the systematic representation differences attributed to the visual modality. Our paper compares word embeddings from three vision-and-language models (CLIP, OpenCLIP and Multilingual CLIP) and three text-only models, with static (FastText) as well as contextual representations (multilingual BERT; XLM-RoBERTa). This is the first large-scale study of the effect of visual grounding on language representations, including 46 semantic parameters. We identify meaning properties and relations that characterize words whose embeddings are most affected by the inclusion of visual modality in the training data; that is, points where visual grounding turns out most important. We find that the effect of visual modality correlates most with denotational semantic properties related to concreteness, but is also detected for several specific semantic classes, as well as for valence, a sentiment-related connotational property of linguistic expressions.

Related papers

How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations [17.528100902591056]
Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing. Speech exhibits larger cross-lingual differences than text. For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.
arXiv Detail & Related papers (2024-11-26T18:29:11Z)
Expand BERT Representation with Visual Information via Grounded Language Learning with Multimodal Partial Alignment [11.148099070407431]
GroundedBERT is a grounded language learning method that enhances the BERT representation with visually grounded information. Our proposed method significantly outperforms the baseline language models on various language tasks of the GLUE and SQuAD datasets.
arXiv Detail & Related papers (2023-12-04T03:16:48Z)
What Do Self-Supervised Speech Models Know About Words? [23.163029143563893]
Self-supervised speech models (S3Ms) have been introduced over the last few years, improving performance and data efficiency on various speech tasks. Recent work has begun analyzing how S3Ms encode certain properties, such as phonetic and speaker information. We use lightweight analysis methods to study segment-level linguistic properties encoded in S3Ms.
arXiv Detail & Related papers (2023-06-30T22:36:41Z)
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP. We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning. Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity. We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z)
Efficient Multi-Modal Embeddings from Structured Data [0.0]
Multi-modal word semantics aims to enhance embeddings with perceptual input. Visual grounding can contribute to linguistic applications as well. New embedding conveys complementary information for text based embeddings.
arXiv Detail & Related papers (2021-10-06T08:42:09Z)
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union. VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z)
Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z)
Probing Contextual Language Models for Common Ground with Visual Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations. Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories. Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.