Accurate Word Representations with Universal Visual Guidance
- URL: http://arxiv.org/abs/2012.15086v1
- Date: Wed, 30 Dec 2020 09:11:50 GMT
- Title: Accurate Word Representations with Universal Visual Guidance
- Authors: Zhuosheng Zhang, Haojie Yu, Hai Zhao, Rui Wang, Masao Utiyama
- Abstract summary: This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
- Score: 55.71425503859685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word representation is a fundamental component in neural language
understanding models. Recently, pre-trained language models (PrLMs) offer a new
performant method of contextualized word representations by leveraging the
sequence-level context for modeling. Although the PrLMs generally give more
accurate contextualized word representations than non-contextualized models do,
they are still subject to a sequence of text contexts without diverse hints for
word representation from multimodality. This paper thus proposes a visual
representation method to explicitly enhance conventional word embedding with
multiple-aspect senses from visual guidance. In detail, we build a small-scale
word-image dictionary from a multimodal seed dataset where each word
corresponds to diverse related images. The texts and paired images are encoded
in parallel, followed by an attention layer to integrate the multimodal
representations. We show that the method substantially improves the accuracy of
disambiguation. Experiments on 12 natural language understanding and machine
translation tasks further verify the effectiveness and the generalization
capability of the proposed approach.
Related papers
- A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language
Representation Learning [23.45678557013005]
We propose a jointly masked multimodal modeling method to learn fine-grained multimodal representations.
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
Our model achieves state-of-the-art performance on various downstream vision-language tasks, including image-text retrieval, visual question answering, visual reasoning, and weakly-supervised visual grounding.
arXiv Detail & Related papers (2022-10-09T06:31:15Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Modeling Semantic Composition with Syntactic Hypergraph for Video
Question Answering [14.033438649614219]
Key challenge in video question answering is how to realize the cross-modal semantic alignment between textual concepts and corresponding visual objects.
We propose to first build a syntactic dependency tree for each question with an off-the-shelf tool.
Based on the extracted compositions, a hypergraph is further built by viewing the words as nodes and the compositions as hyperedges.
arXiv Detail & Related papers (2022-05-13T09:28:13Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.