Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers
- URL: http://arxiv.org/abs/2312.00878v3
- Date: Thu, 14 Dec 2023 11:24:46 GMT
- Title: Grounding Everything: Emerging Localization Properties in
Vision-Language Transformers
- Authors: Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne
- Abstract summary: We show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning.
We propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path.
We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation.
- Score: 51.260510447308306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language foundation models have shown remarkable performance in
various zero-shot settings such as image retrieval, classification, or
captioning. But so far, those models seem to fall behind when it comes to
zero-shot localization of referential expressions and objects in images. As a
result, they need to be fine-tuned for this task. In this paper, we show that
pretrained vision-language (VL) models allow for zero-shot open-vocabulary
object localization without any fine-tuning. To leverage those capabilities, we
propose a Grounding Everything Module (GEM) that generalizes the idea of
value-value attention introduced by CLIPSurgery to a self-self attention path.
We show that the concept of self-self attention corresponds to clustering, thus
enforcing groups of tokens arising from the same object to be similar while
preserving the alignment with the language space. To further guide the group
formation, we propose a set of regularizations that allows the model to finally
generalize across datasets and backbones. We evaluate the proposed GEM
framework on various benchmark tasks and datasets for semantic segmentation. It
shows that GEM not only outperforms other training-free open-vocabulary
localization methods, but also achieves state-of-the-art results on the
recently proposed OpenImagesV7 large-scale segmentation benchmark.
Related papers
- Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Subobject-level Image Tokenization [60.80949852899857]
Transformer-based vision models typically tokenize images into fixed-size square patches as input units.
Inspired by the subword tokenization widely adopted in language models, we propose an image tokenizer at a subobject level.
arXiv Detail & Related papers (2024-02-22T06:47:44Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Natural Scene Image Annotation Using Local Semantic Concepts and Spatial
Bag of Visual Words [0.0]
This paper introduces a framework for automatically annotating natural scene images with local semantic labels from a predefined vocabulary.
The framework is based on a hypothesis that assumes that, in natural scenes, intermediate semantic concepts are correlated with the local keypoints.
Based on this hypothesis, image regions can be efficiently represented by BOW model and using a machine learning approach, such as SVM, to label image regions with semantic annotations.
arXiv Detail & Related papers (2022-10-17T12:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.