Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition
- URL: http://arxiv.org/abs/2402.13643v1
- Date: Wed, 21 Feb 2024 09:22:45 GMT
- Title: Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition
- Authors: Mingkun Yang, Biao Yang, Minghui Liao, Yingying Zhu, Xiang Bai
- Abstract summary: We propose a novel approach called Class-Aware Mask-guided feature refinement (CAM)
Our approach introduces canonical class-aware glyph masks to suppress background and text style noise.
By enhancing the alignment between the canonical mask feature and the text feature, the module ensures more effective fusion.
- Score: 56.968108142307976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition is a rapidly developing field that faces numerous
challenges due to the complexity and diversity of scene text, including complex
backgrounds, diverse fonts, flexible arrangements, and accidental occlusions.
In this paper, we propose a novel approach called Class-Aware Mask-guided
feature refinement (CAM) to address these challenges. Our approach introduces
canonical class-aware glyph masks generated from a standard font to effectively
suppress background and text style noise, thereby enhancing feature
discrimination. Additionally, we design a feature alignment and fusion module
to incorporate the canonical mask guidance for further feature refinement for
text recognition. By enhancing the alignment between the canonical mask feature
and the text feature, the module ensures more effective fusion, ultimately
leading to improved recognition performance. We first evaluate CAM on six
standard text recognition benchmarks to demonstrate its effectiveness.
Furthermore, CAM exhibits superiority over the state-of-the-art method by an
average performance gain of 4.1% across six more challenging datasets, despite
utilizing a smaller model size. Our study highlights the importance of
incorporating canonical mask guidance and aligned feature refinement techniques
for robust scene text recognition. The code is available at
https://github.com/MelosY/CAM.
Related papers
- TextMaster: Universal Controllable Text Edit [5.7173370525015095]
We propose TextMaster, a solution capable of accurately editing text with high realism and proper layout in any scenario and image area.
Our approach employs adaptive standard letter spacing as guidance during training and uses adaptive mask boosting to prevent the leakage of text position and size information.
By injecting high-resolution standard font information and applying perceptual loss in the text editing area, we further enhance text rendering accuracy and fidelity.
arXiv Detail & Related papers (2024-10-13T15:39:39Z) - Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions.
We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z) - MTA-CLIP: Language-Guided Semantic Segmentation with Mask-Text Alignment [53.235290505274676]
Large-scale vision-language models such as CLIP can improve semantic segmentation performance.
We introduce MTA-CLIP, a novel framework employing mask-level vision-language alignment.
MTA-CLIP achieves state-of-the-art, surpassing prior works by an average of 2.8% and 1.3% on benchmark datasets.
arXiv Detail & Related papers (2024-07-31T14:56:42Z) - Improving Face Recognition from Caption Supervision with Multi-Granular
Contextual Feature Aggregation [0.0]
We introduce caption-guided face recognition (CGFR) as a new framework to improve the performance of commercial-off-the-shelf (COTS) face recognition systems.
We implement the proposed CGFR framework on two face recognition models (ArcFace and AdaFace) and evaluated its performance on the Multi-Modal CelebA-HQ dataset.
arXiv Detail & Related papers (2023-08-13T23:52:15Z) - TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image
Super-Resolution [18.73348268987249]
TextDiff is a diffusion-based framework tailored for scene text image super-resolution.
It achieves state-of-the-art (SOTA) performance on public benchmark datasets.
Our proposed MRD module is plug-and-play that effectively sharpens the text edges produced by SOTA methods.
arXiv Detail & Related papers (2023-08-13T11:02:16Z) - Towards Robust Scene Text Image Super-resolution via Explicit Location
Enhancement [59.66539728681453]
Scene text image super-resolution (STISR) aims to improve image quality while boosting downstream scene text recognition accuracy.
Most existing methods treat the foreground (character regions) and background (non-character regions) equally in the forward process.
We propose a novel method LEMMA that explicitly models character regions to produce high-level text-specific guidance for super-resolution.
arXiv Detail & Related papers (2023-07-19T05:08:47Z) - Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy.
We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene.
Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.