Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling
Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt
Augmentation and Text-To-Image Diffusion
- URL: http://arxiv.org/abs/2307.05564v1
- Date: Sun, 9 Jul 2023 22:39:37 GMT
- Title: Augmenters at SemEval-2023 Task 1: Enhancing CLIP in Handling
Compositionality and Ambiguity for Zero-Shot Visual WSD through Prompt
Augmentation and Text-To-Image Diffusion
- Authors: Jie S. Li, Yow-Ting Shiue, Yong-Siang Shih, and Jonas Geiping
- Abstract summary: This paper describes our zero-shot approaches for the Visual Word Sense Disambiguation Task in English.
Our preliminary study shows that the simple approach of matching candidate images with the phrase using CLIP suffers from the many-to-many nature of image-text pairs.
We find that the CLIP text encoder may have limited abilities in capturing the compositionality in natural language.
- Score: 7.708214550816408
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper describes our zero-shot approaches for the Visual Word Sense
Disambiguation (VWSD) Task in English. Our preliminary study shows that the
simple approach of matching candidate images with the phrase using CLIP suffers
from the many-to-many nature of image-text pairs. We find that the CLIP text
encoder may have limited abilities in capturing the compositionality in natural
language. Conversely, the descriptive focus of the phrase varies from instance
to instance. We address these issues in our two systems, Augment-CLIP and
Stable Diffusion Sampling (SD Sampling). Augment-CLIP augments the text prompt
by generating sentences that contain the context phrase with the help of large
language models (LLMs). We further explore CLIP models in other languages, as
the an ambiguous word may be translated into an unambiguous one in the other
language. SD Sampling uses text-to-image Stable Diffusion to generate multiple
images from the given phrase, increasing the likelihood that a subset of images
match the one that paired with the text.
Related papers
- DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.
It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.
DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z) - ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions [17.934227561793474]
Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text.
We propose ContextBLIP, a method that relies on a doubly contextual alignment scheme for challenging IRCD.
We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.
arXiv Detail & Related papers (2024-05-29T16:06:21Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions [69.01985134519244]
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
arXiv Detail & Related papers (2023-05-23T14:18:11Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Zero-Shot Video Captioning with Evolving Pseudo-Tokens [79.16706829968673]
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model.
The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames.
Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.
arXiv Detail & Related papers (2022-07-22T14:19:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.