Experimenting with Self-Supervision using Rotation Prediction for Image
Captioning
- URL: http://arxiv.org/abs/2107.13111v1
- Date: Wed, 28 Jul 2021 00:46:27 GMT
- Title: Experimenting with Self-Supervision using Rotation Prediction for Image
Captioning
- Authors: Ahmed Elhagry, Karima Kadaoui
- Abstract summary: Image captioning is a task in the field of Artificial Intelligence that merges between computer vision and natural language processing.
We are using an encoder-decoder architecture where the encoder is a convolutional neural network (CNN) trained on OpenImages dataset.
We learn image features in a self-supervised fashion using the rotation pretext task.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image captioning is a task in the field of Artificial Intelligence that
merges between computer vision and natural language processing. It is
responsible for generating legends that describe images, and has various
applications like descriptions used by assistive technology or indexing images
(for search engines for instance). This makes it a crucial topic in AI that is
undergoing a lot of research. This task however, like many others, is trained
on large images labeled via human annotation, which can be very cumbersome: it
needs manual effort, both financial and temporal costs, it is error-prone and
potentially difficult to execute in some cases (e.g. medical images). To
mitigate the need for labels, we attempt to use self-supervised learning, a
type of learning where models use the data contained within the images
themselves as labels. It is challenging to accomplish though, since the task is
two-fold: the images and captions come from two different modalities and
usually handled by different types of networks. It is thus not obvious what a
completely self-supervised solution would look like. How it would achieve
captioning in a comparable way to how self-supervision is applied today on
image recognition tasks is still an ongoing research topic. In this project, we
are using an encoder-decoder architecture where the encoder is a convolutional
neural network (CNN) trained on OpenImages dataset and learns image features in
a self-supervised fashion using the rotation pretext task. The decoder is a
Long Short-Term Memory (LSTM), and it is trained, along within the image
captioning model, on MS COCO dataset and is responsible of generating captions.
Our GitHub repository can be found:
https://github.com/elhagry1/SSL_ImageCaptioning_RotationPrediction
Related papers
- Compressed Image Captioning using CNN-based Encoder-Decoder Framework [0.0]
We develop an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models.
We also do a performance comparison where we delved into the realm of pre-trained CNN models.
In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" models.
arXiv Detail & Related papers (2024-04-28T03:47:48Z) - DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only
Training [73.74291217502928]
We propose a simple framework, named DeCap, for zero-shot captioning.
We introduce a lightweight visual-aware language decoder.
We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input.
arXiv Detail & Related papers (2023-03-06T11:02:47Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - Neural Twins Talk & Alternative Calculations [3.198144010381572]
Inspired by how the human brain employs a higher number of neural pathways when describing a highly focused subject, we show that deep attentive models could be extended to achieve better performance.
Image captioning bridges a gap between computer vision and natural language processing.
arXiv Detail & Related papers (2021-08-05T18:41:34Z) - Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation.
In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network.
We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Attention Beam: An Image Captioning Approach [33.939487457110566]
In recent times, encoder-decoder based architectures have achieved state-of-the-art results for image captioning.
Here, we present a beam search on top of the encoder-decoder based architecture that gives better quality captions on three benchmark datasets.
arXiv Detail & Related papers (2020-11-03T14:57:42Z) - Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner.
We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.