Distributed Attention for Grounded Image Captioning
- URL: http://arxiv.org/abs/2108.01056v1
- Date: Mon, 2 Aug 2021 17:28:33 GMT
- Title: Distributed Attention for Grounded Image Captioning
- Authors: Nenglun Chen, Xingjia Pan, Runnan Chen, Lei Yang, Zhiwen Lin, Yuqiang
Ren, Haolei Yuan, Xiaowei Guo, Feiyue Huang, Wenping Wang
- Abstract summary: We study the problem of weakly supervised grounded image captioning.
The goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
- Score: 55.752968732796354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of weakly supervised grounded image captioning. That is,
given an image, the goal is to automatically generate a sentence describing the
context of the image with each noun word grounded to the corresponding region
in the image. This task is challenging due to the lack of explicit fine-grained
region word alignments as supervision. Previous weakly supervised methods
mainly explore various kinds of regularization schemes to improve attention
accuracy. However, their performances are still far from the fully supervised
ones. One main issue that has been ignored is that the attention for generating
visually groundable words may only focus on the most discriminate parts and can
not cover the whole object. To this end, we propose a simple yet effective
method to alleviate the issue, termed as partial grounding problem in our
paper. Specifically, we design a distributed attention mechanism to enforce the
network to aggregate information from multiple spatially different regions with
consistent semantics while generating the words. Therefore, the union of the
focused region proposals should form a visual region that encloses the object
of interest completely. Extensive experiments have demonstrated the superiority
of our proposed method compared with the state-of-the-arts.
Related papers
- Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - AttnGrounder: Talking to Cars with Attention [6.09170287691728]
We propose a single-stage end-to-end trainable model for the task of visual grounding.
Visual grounding aims to localize a specific object in an image based on a given natural language text query.
We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.
arXiv Detail & Related papers (2020-09-11T23:18:55Z) - Fine-Grained Image Captioning with Global-Local Discriminative Objective [80.73827423555655]
We propose a novel global-local discriminative objective to facilitate generating fine-grained descriptive captions.
We evaluate the proposed method on the widely used MS-COCO dataset.
arXiv Detail & Related papers (2020-07-21T08:46:02Z) - Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
Distillation [55.198596946371126]
We propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching.
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost.
arXiv Detail & Related papers (2020-07-03T22:02:00Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z) - Contrastive Learning for Weakly Supervised Phrase Grounding [99.73968052506206]
We show that phrase grounding can be learned by optimizing word-region attention.
A key idea is to construct effective negative captions for learning through language model guided word substitutions.
Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K benchmark.
arXiv Detail & Related papers (2020-06-17T15:00:53Z) - MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language
Queries at Phrase Level [6.47137925955334]
We propose to utilize spatial attention networks for image-level visual-textual fusion.
We refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query.
For such referring expression dataset ReferIt, our Multi-region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art.
arXiv Detail & Related papers (2020-06-06T04:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.