Learning Zero-Shot Multifaceted Visually Grounded Word Embeddingsvia
Multi-Task Training
- URL: http://arxiv.org/abs/2104.07500v1
- Date: Thu, 15 Apr 2021 14:49:11 GMT
- Title: Learning Zero-Shot Multifaceted Visually Grounded Word Embeddingsvia
Multi-Task Training
- Authors: Hassan Shahmohammadi, Hendrik P. A. Lensch, R. Harald Baayen
- Abstract summary: Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world.
We argue that this approach sacrifices the abstract knowledge obtained from linguistic co-occurrence statistics in the process of acquiring perceptual information.
- Score: 8.271859911016719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language grounding aims at linking the symbolic representation of language
(e.g., words) into the rich perceptual knowledge of the outside world. The
general approach is to embed both textual and visual information into a common
space -the grounded space-confined by an explicit relationship between both
modalities. We argue that this approach sacrifices the abstract knowledge
obtained from linguistic co-occurrence statistics in the process of acquiring
perceptual information. The focus of this paper is to solve this issue by
implicitly grounding the word embeddings. Rather than learning two mappings
into a joint space, our approach integrates modalities by determining a
reversible grounded mapping between the textual and the grounded space by means
of multi-task learning. Evaluations on intrinsic and extrinsic tasks show that
our embeddings are highly beneficial for both abstract and concrete words. They
are strongly correlated with human judgments and outperform previous works on a
wide range of benchmarks. Our grounded embeddings are publicly available here.
Related papers
- Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Language with Vision: a Study on Grounded Word and Sentence Embeddings [6.231247903840833]
Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations.
The present study proposes a computational grounding model for pre-trained word embeddings.
Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information.
arXiv Detail & Related papers (2022-06-17T15:04:05Z) - MCSE: Multimodal Contrastive Learning of Sentence Embeddings [23.630041603311923]
We propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective.
We show that our approach consistently improves the performance across various datasets and pre-trained encoders.
arXiv Detail & Related papers (2022-04-22T21:19:24Z) - Aspectuality Across Genre: A Distributional Semantics Approach [25.816944882581343]
The interpretation of the lexical aspect of verbs in English plays a crucial role for recognizing textual entailment and learning discourse-level inferences.
We show that two elementary dimensions of aspectual class, states vs. events, and telic vs. atelic events, can be modelled effectively with distributional semantics.
arXiv Detail & Related papers (2020-10-31T19:37:22Z) - Improving Machine Reading Comprehension with Contextualized Commonsense
Knowledge [62.46091695615262]
We aim to extract commonsense knowledge to improve machine reading comprehension.
We propose to represent relations implicitly by situating structured knowledge in a context.
We employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader.
arXiv Detail & Related papers (2020-09-12T17:20:01Z) - Understanding Spatial Relations through Multiple Modalities [78.07328342973611]
spatial relations between objects can either be explicit -- expressed as spatial prepositions, or implicit -- expressed by spatial verbs such as moving, walking, shifting, etc.
We introduce the task of inferring implicit and explicit spatial relations between two entities in an image.
We design a model that uses both textual and visual information to predict the spatial relations, making use of both positional and size information of objects and image embeddings.
arXiv Detail & Related papers (2020-07-19T01:35:08Z) - On Vocabulary Reliance in Scene Text Recognition [79.21737876442253]
Methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary.
We call this phenomenon "vocabulary reliance"
We propose a simple yet effective mutual learning strategy to allow models of two families to learn collaboratively.
arXiv Detail & Related papers (2020-05-08T11:16:58Z) - On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics.
Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings.
We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z) - Incorporating Visual Semantics into Sentence Representations within a
Grounded Space [20.784771968813747]
We propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space.
We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.
arXiv Detail & Related papers (2020-02-07T12:26:41Z) - A Common Semantic Space for Monolingual and Cross-Lingual
Meta-Embeddings [10.871587311621974]
This paper presents a new technique for creating monolingual and cross-lingual meta-embeddings.
Existing word vectors are projected to a common semantic space using linear transformations and averaging.
The resulting cross-lingual meta-embeddings also exhibit excellent cross-lingual transfer learning capabilities.
arXiv Detail & Related papers (2020-01-17T15:42:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.