CLIP also Understands Text: Prompting CLIP for Phrase Understanding
- URL: http://arxiv.org/abs/2210.05836v1
- Date: Tue, 11 Oct 2022 23:35:18 GMT
- Title: CLIP also Understands Text: Prompting CLIP for Phrase Understanding
- Authors: An Yan, Jiacheng Li, Wanrong Zhu, Yujie Lu, William Yang Wang, Julian
McAuley
- Abstract summary: Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision.
In this paper, we find that the text encoder of CLIP actually demonstrates strong ability for phrase understanding, and can even significantly outperform popular language models such as BERT with a properly designed prompt.
- Score: 65.59857372525664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pretraining (CLIP) efficiently learns visual
concepts by pre-training with natural language supervision. CLIP and its visual
encoder have been explored on various vision and language tasks and achieve
strong zero-shot or transfer learning performance. However, the application of
its text encoder solely for text understanding has been less explored. In this
paper, we find that the text encoder of CLIP actually demonstrates strong
ability for phrase understanding, and can even significantly outperform popular
language models such as BERT with a properly designed prompt. Extensive
experiments validate the effectiveness of our method across different datasets
and domains on entity clustering and entity set expansion tasks.
Related papers
- Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP [46.53595526049201]
A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space shared with images.
We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI)
SToRI refines the text encoding process in CLIP by differentially weighting semantic elements based on contextual importance.
arXiv Detail & Related papers (2024-10-11T02:42:13Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - On the Difference of BERT-style and CLIP-style Text Encoders [21.276382551459847]
Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing.
Recent contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks.
arXiv Detail & Related papers (2023-06-06T13:41:09Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.