An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU)
- URL: http://arxiv.org/abs/2301.02440v1
- Date: Fri, 6 Jan 2023 10:00:06 GMT
- Title: An Image captioning algorithm based on the Hybrid Deep Learning
Technique (CNN+GRU)
- Authors: Rana Adnan Ahmad, Muhammad Azhar, Hina Sattar
- Abstract summary: We present a CNN-GRU encoder decode framework for caption-to-image reconstructor.
It handles the semantic context into consideration as well as the time complexity.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture captioning in terms of time complexity and accuracy.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning by the encoder-decoder framework has shown tremendous
advancement in the last decade where CNN is mainly used as encoder and LSTM is
used as a decoder. Despite such an impressive achievement in terms of accuracy
in simple images, it lacks in terms of time complexity and space complexity
efficiency. In addition to this, in case of complex images with a lot of
information and objects, the performance of this CNN-LSTM pair downgraded
exponentially due to the lack of semantic understanding of the scenes presented
in the images. Thus, to take these issues into consideration, we present
CNN-GRU encoder decode framework for caption-to-image reconstructor to handle
the semantic context into consideration as well as the time complexity. By
taking the hidden states of the decoder into consideration, the input image and
its similar semantic representations is reconstructed and reconstruction scores
from a semantic reconstructor are used in conjunction with likelihood during
model training to assess the quality of the generated caption. As a result, the
decoder receives improved semantic information, enhancing the caption
production process. During model testing, combining the reconstruction score
and the log-likelihood is also feasible to choose the most appropriate caption.
The suggested model outperforms the state-of-the-art LSTM-A5 model for picture
captioning in terms of time complexity and accuracy.
Related papers
- $ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space.
Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input.
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - Compressed Image Captioning using CNN-based Encoder-Decoder Framework [0.0]
We develop an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models.
We also do a performance comparison where we delved into the realm of pre-trained CNN models.
In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" models.
arXiv Detail & Related papers (2024-04-28T03:47:48Z) - Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - Dual-Stream Knowledge-Preserving Hashing for Unsupervised Video
Retrieval [67.52910255064762]
We design a simple dual-stream structure, including a temporal layer and a hash layer.
We first design a simple dual-stream structure, including a temporal layer and a hash layer.
With the help of semantic similarity knowledge obtained from self-supervision, the hash layer learns to capture information for semantic retrieval.
In this way, the model naturally preserves the disentangled semantics into binary codes.
arXiv Detail & Related papers (2023-10-12T03:21:12Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - Neural Data-Dependent Transform for Learned Image Compression [72.86505042102155]
We build a neural data-dependent transform and introduce a continuous online mode decision mechanism to jointly optimize the coding efficiency for each individual image.
The experimental results show the effectiveness of the proposed neural-syntax design and the continuous online mode decision mechanism.
arXiv Detail & Related papers (2022-03-09T14:56:48Z) - Dynamic Neural Representational Decoders for High-Resolution Semantic
Segmentation [98.05643473345474]
We propose a novel decoder, termed dynamic neural representational decoder (NRD)
As each location on the encoder's output corresponds to a local patch of the semantic labels, in this work, we represent these local patches of labels with compact neural networks.
This neural representation enables our decoder to leverage the smoothness prior in the semantic label space, and thus makes our decoder more efficient.
arXiv Detail & Related papers (2021-07-30T04:50:56Z) - A Novel Actor Dual-Critic Model for Remote Sensing Image Captioning [32.11006090613004]
We deal with the problem of generating textual captions from optical remote sensing (RS) images using the notion of deep reinforcement learning.
We introduce an Actor Dual-Critic training strategy where a second critic model is deployed in the form of an encoder-decoder RNN.
We observe that the proposed model generates sentences on the test data highly similar to the ground truth and is successful in generating even better captions in many critical cases.
arXiv Detail & Related papers (2020-10-05T13:35:02Z) - Modeling Lost Information in Lossy Image Compression [72.69327382643549]
Lossy image compression is one of the most commonly used operators for digital images.
We propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem.
arXiv Detail & Related papers (2020-06-22T04:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.