Unpaired Image Captioning by Image-level Weakly-Supervised Visual
Concept Recognition
- URL: http://arxiv.org/abs/2203.03195v1
- Date: Mon, 7 Mar 2022 08:02:23 GMT
- Title: Unpaired Image Captioning by Image-level Weakly-Supervised Visual
Concept Recognition
- Authors: Peipei Zhu, Xiao Wang, Yong Luo, Zhenglong Sun, Wei-Shi Zheng, Yaowei
Wang, and Changwen Chen
- Abstract summary: Unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase.
Most existing studies use off-the-shelf algorithms to obtain the visual concepts.
We propose a novel approach to achieve cost-effective UIC using image-level labels.
- Score: 83.93422034664184
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The goal of unpaired image captioning (UIC) is to describe images without
using image-caption pairs in the training phase. Although challenging, we
except the task can be accomplished by leveraging a training set of images
aligned with visual concepts. Most existing studies use off-the-shelf
algorithms to obtain the visual concepts because the Bounding Box (BBox) labels
or relationship-triplet labels used for the training are expensive to acquire.
In order to resolve the problem in expensive annotations, we propose a novel
approach to achieve cost-effective UIC. Specifically, we adopt image-level
labels for the optimization of the UIC model in a weakly-supervised manner. For
each image, we assume that only the image-level labels are available without
specific locations and numbers. The image-level labels are utilized to train a
weakly-supervised object recognition model to extract object information (e.g.,
instance) in an image, and the extracted instances are adopted to infer the
relationships among different objects based on an enhanced graph neural network
(GNN). The proposed approach achieves comparable or even better performance
compared with previous methods without the expensive cost of annotations.
Furthermore, we design an unrecognized object (UnO) loss combined with a visual
concept reward to improve the alignment of the inferred object and relationship
information with the images. It can effectively alleviate the issue encountered
by existing UIC models about generating sentences with nonexistent objects. To
the best of our knowledge, this is the first attempt to solve the problem of
Weakly-Supervised visual concept recognition for UIC (WS-UIC) based only on
image-level labels. Extensive experiments have been carried out to demonstrate
that the proposed WS-UIC model achieves inspiring results on the COCO dataset
while significantly reducing the cost of labeling.
Related papers
- Few-shot Class-Incremental Semantic Segmentation via Pseudo-Labeling and
Knowledge Distillation [3.4436201325139737]
We address the problem of learning new classes for semantic segmentation models from few examples.
For learning from limited data, we propose a pseudo-labeling strategy to augment the few-shot training annotations.
We integrate the above steps into a single convolutional neural network with a unified learning objective.
arXiv Detail & Related papers (2023-08-05T05:05:37Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Semantic Contrastive Bootstrapping for Single-positive Multi-label
Recognition [36.3636416735057]
We present a semantic contrastive bootstrapping (Scob) approach to gradually recover the cross-object relationships.
We then propose a recurrent semantic masked transformer to extract iconic object-level representations.
Extensive experimental results demonstrate that the proposed joint learning framework surpasses the state-of-the-art models.
arXiv Detail & Related papers (2023-07-15T01:59:53Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Learning to Focus: Cascaded Feature Matching Network for Few-shot Image
Recognition [38.49419948988415]
Deep networks can learn to accurately recognize objects of a category by training on a large number of images.
A meta-learning challenge known as a low-shot image recognition task comes when only a few images with annotations are available for learning a recognition model for one category.
Our method, called Cascaded Feature Matching Network (CFMN), is proposed to solve this problem.
Experiments for few-shot learning on two standard datasets, emphminiImageNet and Omniglot, have confirmed the effectiveness of our method.
arXiv Detail & Related papers (2021-01-13T11:37:28Z) - Gradient-Induced Co-Saliency Detection [81.54194063218216]
Co-saliency detection (Co-SOD) aims to segment the common salient foreground in a group of relevant images.
In this paper, inspired by human behavior, we propose a gradient-induced co-saliency detection method.
arXiv Detail & Related papers (2020-04-28T08:40:55Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.