KENGIC: KEyword-driven and N-Gram Graph based Image Captioning
- URL: http://arxiv.org/abs/2302.03729v1
- Date: Tue, 7 Feb 2023 19:48:55 GMT
- Title: KENGIC: KEyword-driven and N-Gram Graph based Image Captioning
- Authors: Brandon Birmingham and Adrian Muscat
- Abstract summary: Keywords-driven and N-gram Graph based approach for Image Captioning (KENGIC)
Model is designed to form a directed graph by connecting nodes through overlapping n-grams as found in a given text corpus.
Analysis of this approach could also shed light on the generation process behind current top performing caption generators trained in the paired setting.
- Score: 0.988326119238361
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a Keyword-driven and N-gram Graph based approach for
Image Captioning (KENGIC). Most current state-of-the-art image caption
generators are trained end-to-end on large scale paired image-caption datasets
which are very laborious and expensive to collect. Such models are limited in
terms of their explainability and their applicability across different domains.
To address these limitations, a simple model based on N-Gram graphs which does
not require any end-to-end training on paired image captions is proposed.
Starting with a set of image keywords considered as nodes, the generator is
designed to form a directed graph by connecting these nodes through overlapping
n-grams as found in a given text corpus. The model then infers the caption by
maximising the most probable n-gram sequences from the constructed graph. To
analyse the use and choice of keywords in context of this approach, this study
analysed the generation of image captions based on (a) keywords extracted from
gold standard captions and (b) from automatically detected keywords. Both
quantitative and qualitative analyses demonstrated the effectiveness of KENGIC.
The performance achieved is very close to that of current state-of-the-art
image caption generators that are trained in the unpaired setting. The analysis
of this approach could also shed light on the generation process behind current
top performing caption generators trained in the paired setting, and in
addition, provide insights on the limitations of the current most widely used
evaluation metrics in automatic image captioning.
Related papers
- Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - Towards Few-shot Entity Recognition in Document Images: A Graph Neural
Network Approach Robust to Image Manipulation [38.09501948846373]
We introduce the topological adjacency relationship among the tokens, emphasizing their relative position information.
We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings.
Experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings.
arXiv Detail & Related papers (2023-05-24T07:34:33Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Partially-supervised novel object captioning leveraging context from
paired data [11.215352918313577]
We create synthetic paired captioning data for novel objects by leveraging context from existing image-caption pairs.
We further re-use these partially paired images with novel objects to create pseudo-label captions.
Our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split.
arXiv Detail & Related papers (2021-09-10T21:31:42Z) - SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - GraphPB: Graphical Representations of Prosody Boundary in Speech
Synthesis [23.836992815219904]
This paper introduces a graphical representation approach of prosody boundary (GraphPB) in the task of Chinese speech synthesis.
The nodes of the graph embedding are formed by prosodic words, and the edges are formed by the other prosodic boundaries.
Two techniques are proposed to embed sequential information into the graph-to-sequence text-to-speech model.
arXiv Detail & Related papers (2020-12-03T03:34:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.