Detector-Free Weakly Supervised Grounding by Separation
- URL: http://arxiv.org/abs/2104.09829v1
- Date: Tue, 20 Apr 2021 08:27:31 GMT
- Title: Detector-Free Weakly Supervised Grounding by Separation
- Authors: Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli
Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda,
Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes,
Rogerio Feris, Leonid Karlinsky
- Abstract summary: Weakly Supervised phrase-Grounding (WSG) deals with the task of using data to learn to localize arbitrary text phrases in images.
We propose Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector.
We demonstrate a significant accuracy improvement, of up to $8.5%$ over previous DF-WSG SotA.
- Score: 76.65699170882036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Nowadays, there is an abundance of data involving images and surrounding
free-form text weakly corresponding to those images. Weakly Supervised
phrase-Grounding (WSG) deals with the task of using this data to learn to
localize (or to ground) arbitrary text phrases in images without any additional
annotations. However, most recent SotA methods for WSG assume the existence of
a pre-trained object detector, relying on it to produce the ROIs for
localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG)
to solve WSG without relying on a pre-trained detector. We directly learn
everything from the images and associated free-form text pairs, thus
potentially gaining an advantage on the categories unsupported by the detector.
The key idea behind our proposed Grounding by Separation (GbS) method is
synthesizing `text to image-regions' associations by random alpha-blending of
arbitrary image pairs and using the corresponding texts of the pair as
conditions to recover the alpha map from the blended image via a segmentation
network. At test time, this allows using the query phrase as a condition for a
non-blended query image, thus interpreting the test image as a composition of a
region corresponding to the phrase and the complement region. Using this
approach we demonstrate a significant accuracy improvement, of up to $8.5\%$
over previous DF-WSG SotA, for a range of benchmarks including Flickr30K,
Visual Genome, and ReferIt, as well as a significant complementary improvement
(above $7\%$) over the detector-based approaches for WSG.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - CLIM: Contrastive Language-Image Mosaic for Region Representation [58.05870131126816]
Contrastive Language-Image Mosaic (CLIM) is a novel approach for aligning region and text representations.
CLIM consistently improves different open-vocabulary object detection methods.
It can effectively enhance the region representation of vision-language models.
arXiv Detail & Related papers (2023-12-18T17:39:47Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Using Text to Teach Image Retrieval [47.72498265721957]
We build on the concept of image manifold to represent the feature space of images, learned via neural networks, as a graph.
We augment the manifold samples with geometrically aligned text, thereby using a plethora of sentences to teach us about images.
The experimental results show that the joint embedding manifold is a robust representation, allowing it to be a better basis to perform image retrieval.
arXiv Detail & Related papers (2020-11-19T16:09:14Z) - Improving Weakly Supervised Visual Grounding by Contrastive Knowledge
Distillation [55.198596946371126]
We propose a contrastive learning framework that accounts for both region-phrase and image-sentence matching.
Our core innovation is the learning of a region-phrase score function, based on which an image-sentence score function is further constructed.
The design of such score functions removes the need of object detection at test time, thereby significantly reducing the inference cost.
arXiv Detail & Related papers (2020-07-03T22:02:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.