Masked Vision-Language Transformers for Scene Text Recognition
- URL: http://arxiv.org/abs/2211.04785v1
- Date: Wed, 9 Nov 2022 10:28:23 GMT
- Title: Masked Vision-Language Transformers for Scene Text Recognition
- Authors: Jie Wu, Ying Peng, Shengming Zhang, Weigang Qi, Jian Zhang
- Abstract summary: Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes.
Recent STR models benefit from taking linguistic information in addition to visual cues into consideration.
We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information.
- Score: 10.057137581956363
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Scene text recognition (STR) enables computers to recognize and read the text
in various real-world scenes. Recent STR models benefit from taking linguistic
information in addition to visual cues into consideration. We propose a novel
Masked Vision-Language Transformers (MVLT) to capture both the explicit and the
implicit linguistic information. Our encoder is a Vision Transformer, and our
decoder is a multi-modal Transformer. MVLT is trained in two stages: in the
first stage, we design a STR-tailored pretraining method based on a masking
strategy; in the second stage, we fine-tune our model and adopt an iterative
correction method to improve the performance. MVLT attains superior results
compared to state-of-the-art STR models on several benchmarks. Our code and
model are available at https://github.com/onealwj/MVLT.
Related papers
- ViTEraser: Harnessing the Power of Vision Transformers for Scene Text
Removal with SegMIM Pretraining [58.241008246380254]
Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds.
Recent STR approaches rely on iterative refinements or explicit text masks, resulting in high complexity and sensitivity to the accuracy of text localization.
We propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser.
arXiv Detail & Related papers (2023-06-21T08:47:20Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - VL-BEiT: Generative Vision-Language Pretraining [107.25298505511184]
We introduce a vision-language foundation model called VL-BEiT, which is a bidirectional multimodal Transformer learned by generative pretraining.
Specifically, we perform masked vision-language modeling on image-text pairs, masked language modeling on texts, and masked image modeling on images.
arXiv Detail & Related papers (2022-06-02T16:14:19Z) - Training Vision-Language Transformers from Captions [80.00302205584335]
We introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders.
In a head-to-head comparison between ViLT and our model, we find that our approach outperforms ViLT on standard benchmarks.
arXiv Detail & Related papers (2022-05-19T00:19:48Z) - VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts [46.55920956687346]
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks.
We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
arXiv Detail & Related papers (2021-11-03T17:20:36Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.