Turning a CLIP Model into a Scene Text Detector
- URL: http://arxiv.org/abs/2302.14338v3
- Date: Sun, 26 Mar 2023 12:58:19 GMT
- Title: Turning a CLIP Model into a Scene Text Detector
- Authors: Wenwen Yu, Yuliang Liu, Wei Hua, Deqiang Jiang, Bo Ren, Xiang Bai
- Abstract summary: Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
- Score: 56.86413150091367
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The recent large-scale Contrastive Language-Image Pretraining (CLIP) model
has shown great potential in various downstream tasks via leveraging the
pretrained vision and language knowledge. Scene text, which contains rich
textual and visual information, has an inherent connection with a model like
CLIP. Recently, pretraining approaches based on vision language models have
made effective progresses in the field of text detection. In contrast to these
works, this paper proposes a new method, termed TCM, focusing on Turning the
CLIP Model directly for text detection without pretraining process. We
demonstrate the advantages of the proposed TCM as follows: (1) The underlying
principle of our framework can be applied to improve existing scene text
detector. (2) It facilitates the few-shot training capability of existing
methods, e.g., by using 10% of labeled data, we significantly improve the
performance of the baseline method with an average of 22% in terms of the
F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene
text detection methods, we further achieve promising domain adaptation ability.
The code will be publicly released at https://github.com/wenwenyu/TCM.
Related papers
- SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.