TSAL: Few-shot Text Segmentation Based on Attribute Learning
- URL: http://arxiv.org/abs/2504.11164v1
- Date: Tue, 15 Apr 2025 13:12:42 GMT
- Title: TSAL: Few-shot Text Segmentation Based on Attribute Learning
- Authors: Chenming Li, Chengxu Liu, Yuanting Fan, Xiao Jin, Xingsong Hou, Xueming Qian,
- Abstract summary: We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation.<n>To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates.<n>Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings.
- Score: 21.413607725856263
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently supervised learning rapidly develops in scene text segmentation. However, the lack of high-quality datasets and the high cost of pixel annotation greatly limit the development of them. Considering the well-performed few-shot learning methods for downstream tasks, we investigate the application of the few-shot learning method to scene text segmentation. We propose TSAL, which leverages CLIP's prior knowledge to learn text attributes for segmentation. To fully utilize the semantic and texture information in the image, a visual-guided branch is proposed to separately extract text and background features. To reduce data dependency and improve text detection accuracy, the adaptive prompt-guided branch employs effective adaptive prompt templates to capture various text attributes. To enable adaptive prompts capture distinctive text features and complex background distribution, we propose Adaptive Feature Alignment module(AFA). By aligning learnable tokens of different attributes with visual features and prompt prototypes, AFA enables adaptive prompts to capture both general and distinctive attribute information. TSAL can capture the unique attributes of text and achieve precise segmentation using only few images. Experiments demonstrate that our method achieves SOTA performance on multiple text segmentation datasets under few-shot settings and show great potential in text-related domains.
Related papers
- LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [63.07563443280147]
We propose a novel framework named LATex for AG-ReID.<n>It adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve the AG-ReID.
arXiv Detail & Related papers (2025-03-31T04:47:05Z) - Enhancing Visual Representation for Text-based Person Searching [9.601697802095119]
VFE-TPS is a Visual Feature Enhanced Text-based Person Search model.
It introduces a pre-trained backbone CLIP to learn basic multimodal features.
It constructs Text Guided Masked Image Modeling task to enhance the model's ability of learning local visual details.
arXiv Detail & Related papers (2024-12-30T01:38:14Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [47.421888361871254]
Scene text images contain not only style information (font, background) but also content information (character, texture)
Previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance.
We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability.
arXiv Detail & Related papers (2024-05-07T15:00:11Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Towards Unified Text-based Person Retrieval: A Large-scale
Multi-Attribute and Language Search Benchmark [24.366997699462075]
We introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS.
Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset.
To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning framework.
arXiv Detail & Related papers (2023-06-05T14:06:24Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.