What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs
- URL: http://arxiv.org/abs/2206.09358v1
- Date: Sun, 19 Jun 2022 09:07:30 GMT
- Title: What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs
- Authors: Tal Shaharabany, Yoad Tewel and Lior Wolf
- Abstract summary: Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
- Score: 82.93345261434943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an input image, and nothing else, our method returns the bounding boxes
of objects in the image and phrases that describe the objects. This is achieved
within an open world paradigm, in which the objects in the input image may not
have been encountered during the training of the localization mechanism.
Moreover, training takes place in a weakly supervised setting, where no
bounding boxes are provided. To achieve this, our method combines two
pre-trained networks: the CLIP image-to-text matching score and the BLIP image
captioning tool. Training takes place on COCO images and their captions and is
based on CLIP. Then, during inference, BLIP is used to generate a hypothesis
regarding various regions of the current image. Our work generalizes weakly
supervised segmentation and phrase grounding and is shown empirically to
outperform the state of the art in both domains. It also shows very convincing
results in the novel task of weakly-supervised open-world purely visual
phrase-grounding presented in our work. For example, on the datasets used for
benchmarking phrase-grounding, our method results in a very modest degradation
in comparison to methods that employ human captions as an additional input. Our
code is available at https://github.com/talshaharabany/what-is-where-by-looking
and a live demo can be found at
https://talshaharabany/what-is-where-by-looking.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Neural Implicit Vision-Language Feature Fields [40.248658511361015]
We present a zero-shot volumetric open-vocabulary semantic scene segmentation method.
Our method builds on the insight that we can fuse image features from a vision-language model into a neural implicit representation.
We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts.
arXiv Detail & Related papers (2023-03-20T09:38:09Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Adapting CLIP For Phrase Localization Without Further Training [30.467802103692378]
We propose to leverage contrastive language-vision models, CLIP, pre-trained on image and caption pairs.
We adapt CLIP to generate high-resolution spatial feature maps.
Our method for phrase localization requires no human annotations or additional training.
arXiv Detail & Related papers (2022-04-07T17:59:38Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.