Top-Down Framework for Weakly-supervised Grounded Image Captioning
- URL: http://arxiv.org/abs/2306.07490v3
- Date: Sat, 2 Mar 2024 15:10:16 GMT
- Title: Top-Down Framework for Weakly-supervised Grounded Image Captioning
- Authors: Chen Cai, Suchen Wang, Kim-hui Yap, Yi Wang
- Abstract summary: Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
- Score: 19.00510117145054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised grounded image captioning (WSGIC) aims to generate the
caption and ground (localize) predicted object words in the input image without
using bounding box supervision. Recent two-stage solutions mostly apply a
bottom-up pipeline: (1) encode the input image into multiple region features
using an object detector; (2) leverage region features for captioning and
grounding. However, utilizing independent proposals produced by object
detectors tends to make the subsequent grounded captioner overfitted in finding
the correct object words, overlooking the relation between objects, and
selecting incompatible proposal regions for grounding. To address these issues,
we propose a one-stage weakly-supervised grounded captioner that directly takes
the RGB image as input to perform captioning and grounding at the top-down
image level. Specifically, we encode the image into visual token
representations and propose a Recurrent Grounding Module (RGM) in the decoder
to obtain precise Visual Language Attention Maps (VLAMs), which recognize the
spatial locations of the objects. In addition, we explicitly inject a relation
module into our one-stage framework to encourage relation understanding through
multi-label classification. This relation semantics served as contextual
information facilitating the prediction of relation and object words in the
caption. We observe that the relation semantic not only assists the grounded
captioner in generating a more accurate caption but also improves the grounding
performance. We validate the effectiveness of our proposed method on two
challenging datasets (Flick30k Entities captioning and MSCOCO captioning). The
experimental results demonstrate that our method achieves state-of-the-art
grounding performance.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Image Captioning with Visual Object Representations Grounded in the
Textual Modality [14.797241131469486]
We explore the possibilities of a shared embedding space between textual and visual modality.
We propose an approach opposite to the current trend, grounding of the representations in the word embedding space of the captioning system.
arXiv Detail & Related papers (2020-10-19T12:21:38Z) - Contrastive Learning for Weakly Supervised Phrase Grounding [99.73968052506206]
We show that phrase grounding can be learned by optimizing word-region attention.
A key idea is to construct effective negative captions for learning through language model guided word substitutions.
Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K benchmark.
arXiv Detail & Related papers (2020-06-17T15:00:53Z) - More Grounded Image Captioning by Distilling Image-Text Matching Model [56.79895670335411]
We propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN) as the effective knowledge distillation for more grounded image captioning.
The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module.
arXiv Detail & Related papers (2020-04-01T12:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.