ViTEraser: Harnessing the Power of Vision Transformers for Scene Text
Removal with SegMIM Pretraining
- URL: http://arxiv.org/abs/2306.12106v2
- Date: Sun, 18 Feb 2024 14:58:09 GMT
- Title: ViTEraser: Harnessing the Power of Vision Transformers for Scene Text
Removal with SegMIM Pretraining
- Authors: Dezhi Peng, Chongyu Liu, Yuliang Liu, Lianwen Jin
- Abstract summary: Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds.
Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization.
We propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser.
- Score: 58.241008246380254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text removal (STR) aims at replacing text strokes in natural scenes
with visually coherent backgrounds. Recent STR approaches rely on iterative
refinements or explicit text masks, resulting in high complexity and
sensitivity to the accuracy of text localization. Moreover, most existing STR
methods adopt convolutional architectures while the potential of vision
Transformers (ViTs) remains largely unexplored. In this paper, we propose a
simple-yet-effective ViT-based text eraser, dubbed ViTEraser. Following a
concise encoder-decoder framework, ViTEraser can easily incorporate various
ViTs to enhance long-range modeling. Specifically, the encoder hierarchically
maps the input image into the hidden space through ViT blocks and patch
embedding layers, while the decoder gradually upsamples the hidden features to
the text-erased image with ViT blocks and patch splitting layers. As ViTEraser
implicitly integrates text localization and inpainting, we propose a novel
end-to-end pretraining method, termed SegMIM, which focuses the encoder and
decoder on the text box segmentation and masked image modeling tasks,
respectively. Experimental results demonstrate that ViTEraser with SegMIM
achieves state-of-the-art performance on STR by a substantial margin and
exhibits strong generalization ability when extended to other tasks,
\textit{e.g.}, tampered scene text detection. Furthermore, we comprehensively
explore the architecture, pretraining, and scalability of the ViT-based
encoder-decoder for STR, which provides deep insights into the application of
ViT to the STR field. Code is available at
https://github.com/shannanyinxiang/ViTEraser.
Related papers
- Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling [44.70973195966149]
Existing scene text removal (STR) task suffers from insufficient training data due to the expensive pixel-level labeling.
We introduce a Text-aware Masked Image Modeling algorithm (TMIM), which can pretrain STR models with low-cost text detection labels.
Our method outperforms other pretrain methods and achieves state-of-the-art performance (37.35 PSNR on SCUT-EnsText)
arXiv Detail & Related papers (2024-09-20T11:52:57Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z) - FETNet: Feature Erasing and Transferring Network for Scene Text Removal [14.763369952265796]
Scene text removal (STR) task aims to remove text regions and recover the background smoothly in images for private information protection.
Most existing STR methods adopt encoder-decoder-based CNNs, with direct copies of the features in the skip connections.
We propose a novel Feature Erasing and Transferring (FET) mechanism to reconfigure the encoded features for STR.
arXiv Detail & Related papers (2023-06-16T02:38:30Z) - PSSTRNet: Progressive Segmentation-guided Scene Text Removal Network [1.7259824817932292]
Scene text removal (STR) is a challenging task due to the complex text fonts, colors, sizes, and background textures in scene images.
We propose a Progressive-guided Scene Text Removal Network(PSSTRNet) to remove the text in the image iteratively.
arXiv Detail & Related papers (2023-06-13T15:20:37Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Masked Vision-Language Transformers for Scene Text Recognition [10.057137581956363]
Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes.
Recent STR models benefit from taking linguistic information in addition to visual cues into consideration.
We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information.
arXiv Detail & Related papers (2022-11-09T10:28:23Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.