More Grounded Image Captioning by Distilling Image-Text Matching Model
- URL: http://arxiv.org/abs/2004.00390v1
- Date: Wed, 1 Apr 2020 12:42:06 GMT
- Title: More Grounded Image Captioning by Distilling Image-Text Matching Model
- Authors: Yuanen Zhou, Meng Wang, Daqing Liu, Zhenzhen Hu, Hanwang Zhang
- Abstract summary: We propose a Part-of-Speech (POS) enhanced image-text matching model (SCAN) as the effective knowledge distillation for more grounded image captioning.
The benefits are two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment regularization for the captioner's visual attention module.
- Score: 56.79895670335411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual attention not only improves the performance of image captioners, but
also serves as a visual interpretation to qualitatively measure the caption
rationality and model transparency. Specifically, we expect that a captioner
can fix its attentive gaze on the correct objects while generating the
corresponding words. This ability is also known as grounded image captioning.
However, the grounding accuracy of existing captioners is far from
satisfactory. To improve the grounding accuracy while retaining the captioning
quality, it is expensive to collect the word-region alignment as strong
supervision. To this end, we propose a Part-of-Speech (POS) enhanced image-text
matching model (SCAN \cite{lee2018stacked}): POS-SCAN, as the effective
knowledge distillation for more grounded image captioning. The benefits are
two-fold: 1) given a sentence and an image, POS-SCAN can ground the objects
more accurately than SCAN; 2) POS-SCAN serves as a word-region alignment
regularization for the captioner's visual attention module. By showing
benchmark experimental results, we demonstrate that conventional image
captioners equipped with POS-SCAN can significantly improve the grounding
accuracy without strong supervision. Last but not the least, we explore the
indispensable Self-Critical Sequence Training (SCST) \cite{Rennie_2017_CVPR} in
the context of grounded image captioning and show that the image-text matching
score can serve as a reward for more grounded captioning
\footnote{https://github.com/YuanEZhou/Grounded-Image-Captioning}.
Related papers
- Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Contrastive Learning for Weakly Supervised Phrase Grounding [99.73968052506206]
We show that phrase grounding can be learned by optimizing word-region attention.
A key idea is to construct effective negative captions for learning through language model guided word substitutions.
Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K benchmark.
arXiv Detail & Related papers (2020-06-17T15:00:53Z) - Pragmatic Issue-Sensitive Image Captioning [11.998287522410404]
We propose Issue-Sensitive Image Captioning (ISIC)
ISIC is a captioning system given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant.
We show how ISIC can complement and enrich the related task of Visual Question Answering.
arXiv Detail & Related papers (2020-04-29T20:00:53Z) - Egoshots, an ego-vision life-logging dataset and semantic fidelity
metric to evaluate diversity in image captioning models [63.11766263832545]
We present a new image captioning dataset, Egoshots, consisting of 978 real life images with no captions.
In order to evaluate the quality of the generated captions, we propose a new image captioning metric, object based Semantic Fidelity (SF)
arXiv Detail & Related papers (2020-03-26T04:43:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.