Turning a CLIP Model into a Scene Text Spotter
- URL: http://arxiv.org/abs/2308.10408v1
- Date: Mon, 21 Aug 2023 01:25:48 GMT
- Title: Turning a CLIP Model into a Scene Text Spotter
- Authors: Wenwen Yu, Yuliang Liu, Xingkui Zhu, Haoyu Cao, Xing Sun, Xiang Bai
- Abstract summary: We exploit the potential of the large-scale Contrastive Language-Image Pretraining (CLIP) model to enhance scene text detection and spotting tasks.
This backbone utilizes visual prompt learning and cross-attention in CLIP to extract image and text-based prior knowledge.
FastTCM-CR50 introduces an instance-language matching process to enhance the synergy between image and text embeddings.
- Score: 73.63953542526917
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We exploit the potential of the large-scale Contrastive Language-Image
Pretraining (CLIP) model to enhance scene text detection and spotting tasks,
transforming it into a robust backbone, FastTCM-CR50. This backbone utilizes
visual prompt learning and cross-attention in CLIP to extract image and
text-based prior knowledge. Using predefined and learnable prompts,
FastTCM-CR50 introduces an instance-language matching process to enhance the
synergy between image and text embeddings, thereby refining text regions. Our
Bimodal Similarity Matching (BSM) module facilitates dynamic language prompt
generation, enabling offline computations and improving performance.
FastTCM-CR50 offers several advantages: 1) It can enhance existing text
detectors and spotters, improving performance by an average of 1.7% and 1.5%,
respectively. 2) It outperforms the previous TCM-CR50 backbone, yielding an
average improvement of 0.2% and 0.56% in text detection and spotting tasks,
along with a 48.5% increase in inference speed. 3) It showcases robust few-shot
training capabilities. Utilizing only 10% of the supervised data, FastTCM-CR50
improves performance by an average of 26.5% and 5.5% for text detection and
spotting tasks, respectively. 4) It consistently enhances performance on
out-of-distribution text detection and spotting datasets, particularly the
NightTime-ArT subset from ICDAR2019-ArT and the DOTA dataset for oriented
object detection. The code is available at https://github.com/wenwenyu/TCM.
Related papers
- FastTextSpotter: A High-Efficiency Transformer for Multilingual Scene Text Spotting [14.054151352916296]
This paper presents FastTextSpotter, a framework that integrates a Swin Transformer visual backbone with a Transformer-Decoder architecture.
FastTextSpotter has been validated across multiple datasets, including ICDAR2015 for regular texts and CTW1500 and TotalText for arbitrary-shaped texts.
Our results indicate that FastTextSpotter achieves superior accuracy in detecting and recognizing multilingual scene text.
arXiv Detail & Related papers (2024-08-27T12:28:41Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Towards Fast and Accurate Image-Text Retrieval with Self-Supervised
Fine-Grained Alignment [9.183014914635553]
We propose an image-text alignment module SelfAlign on top of the independent-embedding framework.
SelfAlign forces image-text alignment at both concept level and context level by self-supervised contrastive learning.
It consistently boosts the accuracy of state-of-the-art non-pretraining independent-embedding models respectively by 9.1%, 4.2% and 6.6% in terms of R@sum score on Flickr30K, MSCOCO 1K and MS-COCO 5K datasets.
arXiv Detail & Related papers (2023-08-27T05:45:54Z) - ICPC: Instance-Conditioned Prompting with Contrastive Learning for
Semantic Segmentation [26.25673603166731]
Recent work shows that transferring the knowledge from CLIP to semantic segmentation via prompt learning can achieve promising performance.
We focus on improving the quality of vision-text alignment from two aspects of prompting design and loss function.
We propose an align-guided contrastive loss to refine the alignment of vision and text embeddings.
arXiv Detail & Related papers (2023-08-14T11:21:47Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Towards Unified Text-based Person Retrieval: A Large-scale
Multi-Attribute and Language Search Benchmark [24.366997699462075]
We introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS.
Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset.
To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning framework.
arXiv Detail & Related papers (2023-06-05T14:06:24Z) - DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via
Word-Region Alignment [104.54362490182335]
DetCLIPv2 is an efficient training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection.
DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner.
With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance.
arXiv Detail & Related papers (2023-04-10T11:08:15Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.