Attention Beam: An Image Captioning Approach
- URL: http://arxiv.org/abs/2011.01753v2
- Date: Wed, 11 Nov 2020 15:17:56 GMT
- Title: Attention Beam: An Image Captioning Approach
- Authors: Anubhav Shrimal, Tanmoy Chakraborty
- Abstract summary: In recent times, encoder-decoder based architectures have achieved state-of-the-art results for image captioning.
Here, we present a beam search on top of the encoder-decoder based architecture that gives better quality captions on three benchmark datasets.
- Score: 33.939487457110566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The aim of image captioning is to generate textual description of a given
image. Though seemingly an easy task for humans, it is challenging for machines
as it requires the ability to comprehend the image (computer vision) and
consequently generate a human-like description for the image (natural language
understanding). In recent times, encoder-decoder based architectures have
achieved state-of-the-art results for image captioning. Here, we present a
heuristic of beam search on top of the encoder-decoder based architecture that
gives better quality captions on three benchmark datasets: Flickr8k, Flickr30k
and MS COCO.
Related papers
- A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning [0.15346678870160887]
We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs.
The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels.
We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks.
arXiv Detail & Related papers (2024-09-27T06:12:31Z) - Compressed Image Captioning using CNN-based Encoder-Decoder Framework [0.0]
We develop an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models.
We also do a performance comparison where we delved into the realm of pre-trained CNN models.
In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" models.
arXiv Detail & Related papers (2024-04-28T03:47:48Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Experimenting with Self-Supervision using Rotation Prediction for Image
Captioning [0.0]
Image captioning is a task in the field of Artificial Intelligence that merges between computer vision and natural language processing.
We are using an encoder-decoder architecture where the encoder is a convolutional neural network (CNN) trained on OpenImages dataset.
We learn image features in a self-supervised fashion using the rotation pretext task.
arXiv Detail & Related papers (2021-07-28T00:46:27Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z) - Image to Language Understanding: Captioning approach [1.7188280334580195]
This project aims to compare different approaches for solving the image captioning problem.
In the encoder-decoder approach, inject and merge architectures were compared against a multi-modal image captioning approach.
On uploading an image, such a system will output the best caption associated with the image.
arXiv Detail & Related papers (2020-02-21T20:15:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.