On the Difference of BERT-style and CLIP-style Text Encoders
- URL: http://arxiv.org/abs/2306.03678v1
- Date: Tue, 6 Jun 2023 13:41:09 GMT
- Title: On the Difference of BERT-style and CLIP-style Text Encoders
- Authors: Zhihong Chen, Guiming Hardy Chen, Shizhe Diao, Xiang Wan, Benyou Wang
- Abstract summary: Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing.
Recent contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks.
- Score: 21.276382551459847
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked language modeling (MLM) has been one of the most popular pretraining
recipes in natural language processing, e.g., BERT, one of the representative
models. Recently, contrastive language-image pretraining (CLIP) has also
attracted attention, especially its vision models that achieve excellent
performance on a broad range of vision tasks. However, few studies are
dedicated to studying the text encoders learned by CLIP. In this paper, we
analyze the difference between BERT-style and CLIP-style text encoders from
three experiments: (i) general text understanding, (ii) vision-centric text
understanding, and (iii) text-to-image generation. Experimental analyses show
that although CLIP-style text encoders underperform BERT-style ones for general
text understanding tasks, they are equipped with a unique ability, i.e.,
synesthesia, for the cross-modal association, which is more similar to the
senses of humans.
Related papers
- Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - CLIP also Understands Text: Prompting CLIP for Phrase Understanding [65.59857372525664]
Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision.
In this paper, we find that the text encoder of CLIP actually demonstrates strong ability for phrase understanding, and can even significantly outperform popular language models such as BERT with a properly designed prompt.
arXiv Detail & Related papers (2022-10-11T23:35:18Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.