DisCLIP: Open-Vocabulary Referring Expression Generation
- URL: http://arxiv.org/abs/2305.19108v1
- Date: Tue, 30 May 2023 15:13:17 GMT
- Title: DisCLIP: Open-Vocabulary Referring Expression Generation
- Authors: Lior Bracha, Eitan Shaar, Aviv Shamsian, Ethan Fetaya, Gal Chechik
- Abstract summary: We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image.
We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene.
Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
- Score: 37.789850573203694
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Referring Expressions Generation (REG) aims to produce textual descriptions
that unambiguously identifies specific objects within a visual scene.
Traditionally, this has been achieved through supervised learning methods,
which perform well on specific data distributions but often struggle to
generalize to new images and concepts. To address this issue, we present a
novel approach for REG, named DisCLIP, short for discriminative CLIP. We build
on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a
contextual description of a target concept in an image while avoiding other
distracting concepts. Notably, this optimization happens at inference time and
does not require additional training or tuning of learned parameters. We
measure the quality of the generated text by evaluating the capability of a
receiver model to accurately identify the described object within the scene. To
achieve this, we use a frozen zero-shot comprehension module as a critique of
our generated referring expressions. We evaluate DisCLIP on multiple referring
expression benchmarks through human evaluation and show that it significantly
outperforms previous methods on out-of-domain datasets. Our results highlight
the potential of using pre-trained visual-semantic models for generating
high-quality contextual descriptions.
Related papers
- Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension [40.21084218601082]
This paper focuses on a challenging setup where target localization is learned directly from image-text pairs.
We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object.
Our method outperforms SOTA methods on three common benchmarks.
arXiv Detail & Related papers (2024-10-02T13:30:32Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers [31.455819448471157]
Generative training has been demonstrated to be powerful for building visual-language models.
On zero-shot discriminative benchmarks, there is still a performance gap between models trained with generative and discriminative objectives.
In this paper, we aim to narrow this gap by improving the efficacy of generative training on classification tasks.
arXiv Detail & Related papers (2023-11-27T19:00:06Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - CLIP-Count: Towards Text-Guided Zero-Shot Object Counting [32.07271723717184]
We propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner.
To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction.
Our method effectively generates high-quality density maps for objects-of-interest.
arXiv Detail & Related papers (2023-05-12T08:19:39Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.