PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR
System
- URL: http://arxiv.org/abs/2206.03001v1
- Date: Tue, 7 Jun 2022 04:33:50 GMT
- Title: PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR
System
- Authors: Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun
Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, Yanjun Ma
- Abstract summary: PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2.
Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed.
- Score: 11.622321298214043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optical character recognition (OCR) technology has been widely used in
various scenes, as shown in Figure 1. Designing a practical OCR system is still
a meaningful but challenging task. In previous work, considering the efficiency
and accuracy, we proposed a practical ultra lightweight OCR system (PP-OCR),
and an optimized version PP-OCRv2. In order to further improve the performance
of PP-OCRv2, a more robust OCR system PP-OCRv3 is proposed in this paper.
PP-OCRv3 upgrades the text detection model and text recognition model in 9
aspects based on PP-OCRv2. For text detector, we introduce a PAN module with
large receptive field named LK-PAN, a FPN module with residual attention
mechanism named RSE-FPN, and DML distillation strategy. For text recognizer,
the base model is replaced from CRNN to SVTR, and we introduce lightweight text
recognition network SVTR LCNet, guided training of CTC by attention, data
augmentation strategy TextConAug, better pre-trained model by self-supervised
TextRotNet, UDML, and UIM to accelerate the model and improve the effect.
Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than
PP-OCRv2 under comparable inference speed. All the above mentioned models are
open-sourced and the code is available in the GitHub repository PaddleOCR which
is powered by PaddlePaddle.
Related papers
- SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition [77.28814034644287]
We propose SVTRv2, a CTC model that beats leading EDTRs in both accuracy and inference speed.
SVTRv2 introduces novel upgrades to handle text irregularity and utilize linguistic context.
We evaluate SVTRv2 in both standard and recent challenging benchmarks.
arXiv Detail & Related papers (2024-11-24T14:21:35Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - DLoRA-TrOCR: Mixed Text Mode Optical Character Recognition Based On Transformer [12.966765239586994]
Multi- fonts, mixed scenes and complex layouts seriously affect the recognition accuracy of traditional OCR models.
We propose a parameter-efficient mixed text recognition method based on pre-trained OCR Transformer, namely DLoRA-TrOCR.
arXiv Detail & Related papers (2024-04-19T09:28:16Z) - To show or not to show: Redacting sensitive text from videos of
electronic displays [4.621328863799446]
We define an approach for redacting personally identifiable text from videos using a combination of optical character recognition (OCR) and natural language processing (NLP) techniques.
We examine the relative performance of this approach when used with different OCR models, specifically Tesseract and the OCR system from Google Cloud Vision (GCV)
arXiv Detail & Related papers (2022-08-19T07:53:04Z) - SiRi: A Simple Selective Retraining Mechanism for Transformer-based
Visual Grounding [131.0977050185209]
Selective Retraining (SiRi) can significantly outperform previous approaches on three popular benchmarks.
SiRi performs surprisingly superior even with limited training data.
We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.
arXiv Detail & Related papers (2022-07-27T07:01:01Z) - Donut: Document Understanding Transformer without OCR [17.397447819420695]
We propose a novel VDU model that is end-to-end trainable without underpinning OCR framework.
Our approach achieves state-of-the-art performance on various document understanding tasks in public benchmark datasets and private industrial service datasets.
arXiv Detail & Related papers (2021-11-30T18:55:19Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System [9.376162696601238]
We introduce bag of tricks to train a better text detector and a better text recognizer.
Experiments on real data show that the precision of PP-OCRv2 is 7% higher than PP-OCR under the same inference cost.
arXiv Detail & Related papers (2021-09-07T15:24:40Z) - Unknown-box Approximation to Improve Optical Character Recognition
Performance [7.805544279853116]
A novel approach is presented for creating a customized preprocessor for a given OCR engine.
Experiments with two datasets and two OCR engines show that the presented preprocessor is able to improve the accuracy of the OCR up to 46% from the baseline.
arXiv Detail & Related papers (2021-05-17T16:09:15Z) - Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it.
Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z) - Optimization-driven Deep Reinforcement Learning for Robust Beamforming
in IRS-assisted Wireless Communications [54.610318402371185]
Intelligent reflecting surface (IRS) is a promising technology to assist downlink information transmissions from a multi-antenna access point (AP) to a receiver.
We minimize the AP's transmit power by a joint optimization of the AP's active beamforming and the IRS's passive beamforming.
We propose a deep reinforcement learning (DRL) approach that can adapt the beamforming strategies from past experiences.
arXiv Detail & Related papers (2020-05-25T01:42:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.