Read, look and detect: Bounding box annotation from image-caption pairs
- URL: http://arxiv.org/abs/2306.06149v1
- Date: Fri, 9 Jun 2023 12:23:20 GMT
- Title: Read, look and detect: Bounding box annotation from image-caption pairs
- Authors: Eduardo Hugo Sanchez
- Abstract summary: We propose a method to locate and label objects in an image by using a form of weaker supervision: image-caption pairs.
Our experiments demonstrate the effectiveness of our approach by achieving a 47.51% recall@1 score in phrase grounding on Flickr30k COCO.
- Score: 2.0305676256390934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various methods have been proposed to detect objects while reducing the cost
of data annotation. For instance, weakly supervised object detection (WSOD)
methods rely only on image-level annotations during training. Unfortunately,
data annotation remains expensive since annotators must provide the categories
describing the content of each image and labeling is restricted to a fixed set
of categories. In this paper, we propose a method to locate and label objects
in an image by using a form of weaker supervision: image-caption pairs. By
leveraging recent advances in vision-language (VL) models and self-supervised
vision transformers (ViTs), our method is able to perform phrase grounding and
object detection in a weakly supervised manner. Our experiments demonstrate the
effectiveness of our approach by achieving a 47.51% recall@1 score in phrase
grounding on Flickr30k Entities and establishing a new state-of-the-art in
object detection by achieving 21.1 mAP 50 and 10.5 mAP 50:95 on MS COCO when
exclusively relying on image-caption pairs.
Related papers
- Search and Detect: Training-Free Long Tail Object Detection via Web-Image Retrieval [46.944526377710346]
We introduce SearchDet, a training-free long-tail object detection framework.
Our proposed method is simple and training-free, yet achieves over 48.7% mAP improvement on ODinW and 59.1% mAP improvement on LVIS.
arXiv Detail & Related papers (2024-09-26T05:14:19Z) - Learning Object-Language Alignments for Open-Vocabulary Object Detection [83.09560814244524]
We propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way.
arXiv Detail & Related papers (2022-11-27T14:47:31Z) - Weakly-Supervised Camouflaged Object Detection with Scribble Annotations [34.78171563557932]
We propose the first weakly-supervised camouflaged object detection (COD) method, using scribble annotations as supervision.
Annotating camouflage objects pixel-wisely takes 60 minutes per image.
We propose a novel consistency loss composed of two parts: a reliable cross-view loss to attain reliable consistency over different images, and a soft inside-view loss to maintain consistency inside a single prediction map.
arXiv Detail & Related papers (2022-07-28T13:40:07Z) - Unpaired Image Captioning by Image-level Weakly-Supervised Visual
Concept Recognition [83.93422034664184]
Unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase.
Most existing studies use off-the-shelf algorithms to obtain the visual concepts.
We propose a novel approach to achieve cost-effective UIC using image-level labels.
arXiv Detail & Related papers (2022-03-07T08:02:23Z) - Is Object Detection Necessary for Human-Object Interaction Recognition? [37.61038047282247]
This paper revisits human-object interaction (HOI) recognition at image level without using supervisions of object location and human pose.
We name it detection-free HOI recognition, in contrast to the existing detection-supervised approaches.
arXiv Detail & Related papers (2021-07-27T21:15:00Z) - Data Augmentation for Object Detection via Differentiable Neural
Rendering [71.00447761415388]
It is challenging to train a robust object detector when annotated data is scarce.
Existing approaches to tackle this problem include semi-supervised learning that interpolates labeled data from unlabeled data.
We introduce an offline data augmentation method for object detection, which semantically interpolates the training data with novel views.
arXiv Detail & Related papers (2021-03-04T06:31:06Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - Cross-Supervised Object Detection [42.783400918552765]
We show how to build better object detectors from weakly labeled images of new categories by leveraging knowledge learned from fully labeled base categories.
We propose a unified framework that combines a detection head trained from instance-level annotations and a recognition head learned from image-level annotations.
arXiv Detail & Related papers (2020-06-26T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.