Visually Grounded Compound PCFGs
- URL: http://arxiv.org/abs/2009.12404v1
- Date: Fri, 25 Sep 2020 19:07:00 GMT
- Title: Visually Grounded Compound PCFGs
- Authors: Yanpeng Zhao and Ivan Titov
- Abstract summary: Exploiting visual groundings for language understanding has recently been drawing much attention.
We study visually grounded grammar induction and learn a constituency from both unlabeled text and its visual captions.
- Score: 65.04669567781634
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Exploiting visual groundings for language understanding has recently been
drawing much attention. In this work, we study visually grounded grammar
induction and learn a constituency parser from both unlabeled text and its
visual groundings. Existing work on this task (Shi et al., 2019) optimizes a
parser via Reinforce and derives the learning signal only from the alignment of
images and sentences. While their model is relatively accurate overall, its
error distribution is very uneven, with low performance on certain constituents
types (e.g., 26.2% recall on verb phrases, VPs) and high on others (e.g., 79.6%
recall on noun phrases, NPs). This is not surprising as the learning signal is
likely insufficient for deriving all aspects of phrase-structure syntax and
gradient estimates are noisy. We show that using an extension of probabilistic
context-free grammar model we can do fully-differentiable end-to-end visually
grounded learning. Additionally, this enables us to complement the image-text
alignment loss with a language modeling objective. On the MSCOCO test captions,
our model establishes a new state of the art, outperforming its non-grounded
version and, thus, confirming the effectiveness of visual groundings in
constituency grammar induction. It also substantially outperforms the previous
grounded model, with largest improvements on more `abstract' categories (e.g.,
+55.1% recall on VPs).
Related papers
- Active Mining Sample Pair Semantics for Image-text Matching [6.370886833310617]
This paper proposes a novel image-text matching model, called Active Mining Sample Pair Semantics image-text matching model (AMSPS)
Compared with the single semantic learning mode of the commonsense learning model with triplet loss function, AMSPS is an active learning idea.
arXiv Detail & Related papers (2023-11-09T15:03:57Z) - Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning.
We study $textitgenerative VLMs$ that are trained for next-word generation given an image.
We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z) - Contextual Distortion Reveals Constituency: Masked Language Models are
Implicit Parsers [7.558415495951758]
We propose a novel method for extracting parse trees from masked language models (LMs)
Our method computes a score for each span based on the distortion of contextual representations resulting from linguistic perturbations.
Our method consistently outperforms previous state-of-the-art methods on English with masked LMs, and also demonstrates superior performance in a multilingual setting.
arXiv Detail & Related papers (2023-06-01T13:10:48Z) - Simple Token-Level Confidence Improves Caption Correctness [117.33497608933169]
Token-Level Confidence, or TLC, is a simple yet surprisingly effective method to assess caption correctness.
We fine-tune a vision-language model on image captioning, input an image and proposed caption to the model, and aggregate token confidences over words or sequences to estimate image-caption consistency.
arXiv Detail & Related papers (2023-05-11T17:58:17Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z) - Contrastive Learning for Weakly Supervised Phrase Grounding [99.73968052506206]
We show that phrase grounding can be learned by optimizing word-region attention.
A key idea is to construct effective negative captions for learning through language model guided word substitutions.
Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K benchmark.
arXiv Detail & Related papers (2020-06-17T15:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.