A Thorough Review on Recent Deep Learning Methodologies for Image
Captioning
- URL: http://arxiv.org/abs/2107.13114v1
- Date: Wed, 28 Jul 2021 00:54:59 GMT
- Title: A Thorough Review on Recent Deep Learning Methodologies for Image
Captioning
- Authors: Ahmed Elhagry, Karima Kadaoui
- Abstract summary: It is becoming increasingly difficult to keep up with the latest research and findings in the field of image captioning.
This review paper serves as a roadmap for researchers to keep up to date with the latest contributions made in the field of image caption generation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image Captioning is a task that combines computer vision and natural language
processing, where it aims to generate descriptive legends for images. It is a
two-fold process relying on accurate image understanding and correct language
understanding both syntactically and semantically. It is becoming increasingly
difficult to keep up with the latest research and findings in the field of
image captioning due to the growing amount of knowledge available on the topic.
There is not, however, enough coverage of those findings in the available
review papers. We perform in this paper a run-through of the current
techniques, datasets, benchmarks and evaluation metrics used in image
captioning. The current research on the field is mostly focused on deep
learning-based methods, where attention mechanisms along with deep
reinforcement and adversarial learning appear to be in the forefront of this
research topic. In this paper, we review recent methodologies such as UpDown,
OSCAR, VIVO, Meta Learning and a model that uses conditional generative
adversarial nets. Although the GAN-based model achieves the highest score,
UpDown represents an important basis for image captioning and OSCAR and VIVO
are more useful as they use novel object captioning. This review paper serves
as a roadmap for researchers to keep up to date with the latest contributions
made in the field of image caption generation.
Related papers
- Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - Recent Advances in Scene Image Representation and Classification [1.8369974607582584]
We review the existing scene image representation methods that are being used widely for image classification.
We compare their performance both qualitatively (e.g., quality of outputs, pros/cons, etc.) and quantitatively (e.g., accuracy)
Overall, this survey provides in-depth insights and applications of recent scene image representation methods for traditional Computer Vision (CV)-based methods, Deep Learning (DL)-based methods, and Search Engine (SE)-based methods.
arXiv Detail & Related papers (2022-06-15T07:12:23Z) - Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning.
The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning.
However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z) - Deep Learning Approaches on Image Captioning: A Review [0.5852077003870417]
Image captioning aims to generate natural language descriptions for visual content in the form of still images.
Deep learning and vision-language pre-training techniques have revolutionized the field, leading to more sophisticated methods and improved performance.
We address the challenges faced in this field by emphasizing issues such as object hallucination, missing context, illumination conditions, contextual understanding, and referring expressions.
We identify several potential future directions for research in this area, which include tackling the information misalignment problem between image and text modalities, mitigating dataset bias, incorporating vision-language pre-training methods to enhance caption generation, and developing improved evaluation tools to accurately
arXiv Detail & Related papers (2022-01-31T00:39:37Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - Survey of Visual-Semantic Embedding Methods for Zero-Shot Image
Retrieval [0.6091702876917279]
We focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area.
We provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching.
A description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented.
arXiv Detail & Related papers (2021-05-16T09:43:25Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.