Evaluating Multimodal Representations on Visual Semantic Textual
Similarity
- URL: http://arxiv.org/abs/2004.01894v1
- Date: Sat, 4 Apr 2020 09:03:04 GMT
- Title: Evaluating Multimodal Representations on Visual Semantic Textual
Similarity
- Authors: Oier Lopez de Lacalle, Ander Salaberria, Aitor Soroa, Gorka Azkune and
Eneko Agirre
- Abstract summary: We present a novel task, Visual Semantic Textual Similarity (vSTS), where such inference ability can be tested directly.
Our experiments using simple multimodal representations show that the addition of image representations produces better inference, compared to text-only representations.
Our work shows, for the first time, the successful contribution of visual information to textual inference, with ample room for more complex multimodal representation options.
- Score: 22.835699807110018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The combination of visual and textual representations has produced excellent
results in tasks such as image captioning and visual question answering, but
the inference capabilities of multimodal representations are largely untested.
In the case of textual representations, inference tasks such as Textual
Entailment and Semantic Textual Similarity have been often used to benchmark
the quality of textual representations. The long term goal of our research is
to devise multimodal representation techniques that improve current inference
capabilities. We thus present a novel task, Visual Semantic Textual Similarity
(vSTS), where such inference ability can be tested directly. Given two items
comprised each by an image and its accompanying caption, vSTS systems need to
assess the degree to which the captions in context are semantically equivalent
to each other. Our experiments using simple multimodal representations show
that the addition of image representations produces better inference, compared
to text-only representations. The improvement is observed both when directly
computing the similarity between the representations of the two items, and when
learning a siamese network based on vSTS training data. Our work shows, for the
first time, the successful contribution of visual information to textual
inference, with ample room for benchmarking more complex multimodal
representation options.
Related papers
- Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model [25.47573567479831]
We propose a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques.
Our method is out-of-the-box and does not require fine-tuning or optimization.
arXiv Detail & Related papers (2024-05-16T17:59:21Z) - CoPL: Contextual Prompt Learning for Vision-Language Understanding [21.709017504227823]
We propose a Contextual Prompt Learning (CoPL) framework, capable of aligning the prompts to the localized features of the image.
Our key innovations over earlier works include using local image features as part of the prompt learning process, and more crucially, learning to weight these prompts based on local features that are appropriate for the task at hand.
Our method produces substantially improved performance when compared to the current state of the art methods.
arXiv Detail & Related papers (2023-07-03T10:14:33Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.