Learning Object Detection from Captions via Textual Scene Attributes
- URL: http://arxiv.org/abs/2009.14558v1
- Date: Wed, 30 Sep 2020 10:59:20 GMT
- Title: Learning Object Detection from Captions via Textual Scene Attributes
- Authors: Achiya Jerbi, Roei Herzig, Jonathan Berant, Gal Chechik, Amir
Globerson
- Abstract summary: We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
- Score: 70.90708863394902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object detection is a fundamental task in computer vision, requiring large
annotated datasets that are difficult to collect, as annotators need to label
objects and their bounding boxes. Thus, it is a significant challenge to use
cheaper forms of supervision effectively. Recent work has begun to explore
image captions as a source for weak supervision, but to date, in the context of
object detection, captions have only been used to infer the categories of the
objects in the image. In this work, we argue that captions contain much richer
information about the image, including attributes of objects and their
relations. Namely, the text represents a scene of the image, as described
recently in the literature. We present a method that uses the attributes in
this "textual scene graph" to train object detectors. We empirically
demonstrate that the resulting model achieves state-of-the-art results on
several challenging object detection datasets, outperforming recent approaches.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - PEEKABOO: Hiding parts of an image for unsupervised object localization [7.161489957025654]
Localizing objects in an unsupervised manner poses significant challenges due to the absence of key visual information.
We propose a single-stage learning framework, dubbed PEEKABOO, for unsupervised object localization.
The key idea is to selectively hide parts of an image and leverage the remaining image information to infer the location of objects without explicit supervision.
arXiv Detail & Related papers (2024-07-24T20:35:20Z) - Salient Object Detection for Images Taken by People With Vision
Impairments [13.157939981657886]
We introduce a new salient object detection dataset using images taken by people who are visually impaired.
VizWiz-SalientObject is the largest (i.e., 32,000 human-annotated images) and contains unique characteristics.
We benchmarked seven modern salient object detection methods on our dataset and found they struggle most with images featuring large, have less complex boundaries, and lack text.
arXiv Detail & Related papers (2023-01-12T22:33:01Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - Automatic dataset generation for specific object detection [6.346581421948067]
We present a method to synthesize object-in-scene images, which can preserve the objects' detailed features without bringing irrelevant information.
Our result shows that in the synthesized image, the boundaries of objects blend very well with the background.
arXiv Detail & Related papers (2022-07-16T07:44:33Z) - DALL-E for Detection: Language-driven Context Image Synthesis for Object
Detection [18.276823176045525]
We propose a new paradigm for automatic context image generation at scale.
At the core of our approach lies utilizing an interplay between language description of context and language-driven image generation.
We demonstrate the advantages of our approach over the prior context image generation approaches on four object detection datasets.
arXiv Detail & Related papers (2022-06-20T06:43:17Z) - Context-Aware Transfer Attacks for Object Detection [51.65308857232767]
We present a new approach to generate context-aware attacks for object detectors.
We show that by using co-occurrence of objects and their relative locations and sizes as context information, we can successfully generate targeted mis-categorization attacks.
arXiv Detail & Related papers (2021-12-06T18:26:39Z) - A Simple and Effective Use of Object-Centric Images for Long-Tailed
Object Detection [56.82077636126353]
We take advantage of object-centric images to improve object detection in scene-centric images.
We present a simple yet surprisingly effective framework to do so.
Our approach can improve the object detection (and instance segmentation) accuracy of rare objects by 50% (and 33%) relatively.
arXiv Detail & Related papers (2021-02-17T17:27:21Z) - Cross-Supervised Object Detection [42.783400918552765]
We show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories.
We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations.
arXiv Detail & Related papers (2020-06-26T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.