Learning to Represent Image and Text with Denotation Graph
- URL: http://arxiv.org/abs/2010.02949v1
- Date: Tue, 6 Oct 2020 18:00:58 GMT
- Title: Learning to Represent Image and Text with Denotation Graph
- Authors: Bowen Zhang, Hexiang Hu, Vihan Jain, Eugene Ie, Fei Sha
- Abstract summary: We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
- Score: 32.417311523031195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to fuse vision and language information and representing them is an
important research problem with many applications. Recent progresses have
leveraged the ideas of pre-training (from language modeling) and attention
layers in Transformers to learn representation from datasets containing images
aligned with linguistic expressions that describe the images. In this paper, we
propose learning representations from a set of implied, visually grounded
expressions between image and text, automatically mined from those datasets. In
particular, we use denotation graphs to represent how specific concepts (such
as sentences describing images) can be linked to abstract and generic concepts
(such as short phrases) that are also visually grounded. This type of
generic-to-specific relations can be discovered using linguistic analysis
tools. We propose methods to incorporate such relations into learning
representation. We show that state-of-the-art multimodal learning models can be
further improved by leveraging automatically harvested structural relations.
The representations lead to stronger empirical results on downstream tasks of
cross-modal image retrieval, referring expression, and compositional
attribute-object recognition. Both our codes and the extracted denotation
graphs on the Flickr30K and the COCO datasets are publically available on
https://sha-lab.github.io/DG.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Visual Semantic Parsing: From Images to Abstract Meaning Representation [20.60579156219413]
We propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR)
Our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input.
Our findings point to important future research directions for improved scene understanding.
arXiv Detail & Related papers (2022-10-26T17:06:42Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z) - Graph-Structured Referring Expression Reasoning in The Wild [105.95488002374158]
Grounding referring expressions aims to locate in an image an object referred to by a natural language expression.
We propose a scene graph guided modular network (SGMN) to perform reasoning over a semantic graph and a scene graph.
We also propose Ref-Reasoning, a large-scale real-world dataset for structured referring expression reasoning.
arXiv Detail & Related papers (2020-04-19T11:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.