Levenshtein OCR
- URL: http://arxiv.org/abs/2209.03594v1
- Date: Thu, 8 Sep 2022 06:46:50 GMT
- Title: Levenshtein OCR
- Authors: Cheng Da, Peng Wang, Cong Yao
- Abstract summary: A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented.
Inspired by Levenshtein Transformer in the area of NLP, the proposed method explores an alternative way for automatically transcribing textual content from cropped natural images.
- Score: 20.48454415635795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A novel scene text recognizer based on Vision-Language Transformer (VLT) is
presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed
method (named Levenshtein OCR, and LevOCR for short) explores an alternative
way for automatically transcribing textual content from cropped natural images.
Specifically, we cast the problem of scene text recognition as an iterative
sequence refinement process. The initial prediction sequence produced by a pure
vision model is encoded and fed into a cross-modal transformer to interact and
fuse with the visual features, to progressively approximate the ground truth.
The refinement process is accomplished via two basic character-level
operations: deletion and insertion, which are learned with imitation learning
and allow for parallel decoding, dynamic length change and good
interpretability. The quantitative experiments clearly demonstrate that LevOCR
achieves state-of-the-art performances on standard benchmarks and the
qualitative analyses verify the effectiveness and advantage of the proposed
LevOCR algorithm. Code will be released soon.
Related papers
- TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models [11.508589810076147]
TAP-VL treats Optical Character Recognition information as a distinct modality and seamlessly integrates it into any Vision-Language (VL) model.
Experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models.
arXiv Detail & Related papers (2024-11-07T11:54:01Z) - Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets.
We introduce a novel method named Decoder Pre-training with only text for STR (DPTR)
DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - Conditional Variational Autoencoder for Sign Language Translation with
Cross-Modal Alignment [33.96363443363547]
Sign language translation (SLT) aims to convert continuous sign language videos into textual sentences.
We propose a novel framework based on Conditional Variational autoencoder for SLT (CV-SLT)
CV-SLT consists of two paths with two Kullback-Leibler divergences to regularize the outputs of the encoder and decoder.
arXiv Detail & Related papers (2023-12-25T08:20:40Z) - Symmetrical Linguistic Feature Distillation with CLIP for Scene Text
Recognition [77.93678598476149]
We establish a novel Symmetrical Linguistic Feature Distillation framework (named CLIP-OCR)
By cascading the CLIP image encoder with the reversed CLIP text encoder, a symmetrical structure is built with an image-to-text feature flow.
Extensive experiments demonstrate the effectiveness of CLIP-OCR with 93.8% average accuracy on six popular STR benchmarks.
arXiv Detail & Related papers (2023-10-08T04:00:20Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - PreSTU: Pre-Training for Scene-Text Understanding [49.288302725486226]
We propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU)
PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content.
We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
arXiv Detail & Related papers (2022-09-12T18:29:55Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Cycle-Consistent Inverse GAN for Text-to-Image Synthesis [101.97397967958722]
We propose a novel unified framework of Cycle-consistent Inverse GAN for both text-to-image generation and text-guided image manipulation tasks.
We learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image.
In the text-guided optimization module, we generate images with the desired semantic attributes by optimizing the inverted latent codes.
arXiv Detail & Related papers (2021-08-03T08:38:16Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.