ImaginaryNet: Learning Object Detectors without Real Images and
Annotations
- URL: http://arxiv.org/abs/2210.06886v1
- Date: Thu, 13 Oct 2022 10:25:22 GMT
- Title: ImaginaryNet: Learning Object Detectors without Real Images and
Annotations
- Authors: Minheng Ni, Zitong Huang, Kailai Feng, Wangmeng Zuo
- Abstract summary: We propose a framework to synthesize images by combining pretrained language model and text-to-image model.
With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish Imaginary-Supervised Object Detection.
Experiments show that ImaginaryNet can (i) obtain about 70% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data.
- Score: 66.30908705345973
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Without the demand of training in reality, humans can easily detect a known
concept simply based on its language description. Empowering deep learning with
this ability undoubtedly enables the neural network to handle complex vision
tasks, e.g., object detection, without collecting and annotating real images.
To this end, this paper introduces a novel challenging learning paradigm
Imaginary-Supervised Object Detection (ISOD), where neither real images nor
manual annotations are allowed for training object detectors. To resolve this
challenge, we propose ImaginaryNet, a framework to synthesize images by
combining pretrained language model and text-to-image synthesis model. Given a
class label, the language model is used to generate a full description of a
scene with a target object, and the text-to-image model deployed to generate a
photo-realistic image. With the synthesized images and class labels, weakly
supervised object detection can then be leveraged to accomplish ISOD. By
gradually introducing real images and manual annotations, ImaginaryNet can
collaborate with other supervision settings to further boost detection
performance. Experiments show that ImaginaryNet can (i) obtain about 70%
performance in ISOD compared with the weakly supervised counterpart of the same
backbone trained on real data, (ii) significantly improve the baseline while
achieving state-of-the-art or comparable performance by incorporating
ImaginaryNet with other supervision settings.
Related papers
- Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - PUG: Photorealistic and Semantically Controllable Synthetic Data for
Representation Learning [31.81199165450692]
We present a new generation of interactive environments for representation learning research that offer both controllability and realism.
We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG environments and datasets for representation learning.
arXiv Detail & Related papers (2023-08-08T01:33:13Z) - Seeing in Words: Learning to Classify through Language Bottlenecks [59.97827889540685]
Humans can explain their predictions using succinct and intuitive descriptions.
We show that a vision model whose feature representations are text can effectively classify ImageNet images.
arXiv Detail & Related papers (2023-06-29T00:24:42Z) - Txt2Img-MHN: Remote Sensing Image Generation from Text Using Modern
Hopfield Networks [20.856451960761948]
We propose a novel text-to-image modern Hopfield network (Txt2Img-MHN) to generate realistic remote sensing images.
To better evaluate the realism and semantic consistency of the generated images, we conduct zero-shot classification on real remote sensing data.
Experiments on the benchmark remote sensing text-image dataset demonstrate that the proposed Txt2Img-MHN can generate more realistic remote sensing images.
arXiv Detail & Related papers (2022-08-08T22:02:10Z) - De-rendering 3D Objects in the Wild [21.16153549406485]
We present a weakly supervised method that is able to decompose a single image of an object into shape.
For training, the method only relies on a rough initial shape estimate of the training objects to bootstrap the learning process.
In our experiments, we show that the method can successfully de-render 2D images into a 3D representation and generalizes to unseen object categories.
arXiv Detail & Related papers (2022-01-06T23:50:09Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Unsupervised Object-Level Representation Learning from Scene Images [97.07686358706397]
Object-level Representation Learning (ORL) is a new self-supervised learning framework towards scene images.
Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence.
ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks.
arXiv Detail & Related papers (2021-06-22T17:51:24Z) - CONFIG: Controllable Neural Face Image Generation [10.443563719622645]
ConfigNet is a neural face model that allows for controlling individual aspects of output images in meaningful ways.
Our novel method uses synthetic data to factorize the latent space into elements that correspond to the inputs of a traditional rendering pipeline.
arXiv Detail & Related papers (2020-05-06T09:19:46Z) - Self-Supervised Viewpoint Learning From Image Collections [116.56304441362994]
We propose a novel learning framework which incorporates an analysis-by-synthesis paradigm to reconstruct images in a viewpoint aware manner.
We show that our approach performs competitively to fully-supervised approaches for several object categories like human faces, cars, buses, and trains.
arXiv Detail & Related papers (2020-04-03T22:01:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.