PP-StructureV2: A Stronger Document Analysis System
- URL: http://arxiv.org/abs/2210.05391v2
- Date: Thu, 13 Oct 2022 07:11:59 GMT
- Title: PP-StructureV2: A Stronger Document Analysis System
- Authors: Chenxia Li, Ruoyu Guo, Jun Zhou, Mengtao An, Yuning Du, Lingfeng Zhu,
Yi Liu, Xiaoguang Hu, Dianhai Yu
- Abstract summary: A large amount of document data exists in unstructured form such as raw images without any text information.
We propose PP-StructureV2, which contains two subsystems: Layout Information Extraction and Key Information Extraction.
All the above mentioned models and code are open-sourced in the GitHub repository PaddleOCR.
- Score: 9.846187457305879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large amount of document data exists in unstructured form such as raw
images without any text information. Designing a practical document image
analysis system is a meaningful but challenging task. In previous work, we
proposed an intelligent document analysis system PP-Structure. In order to
further upgrade the function and performance of PP-Structure, we propose
PP-StructureV2 in this work, which contains two subsystems: Layout Information
Extraction and Key Information Extraction. Firstly, we integrate Image
Direction Correction module and Layout Restoration module to enhance the
functionality of the system. Secondly, 8 practical strategies are utilized in
PP-StructureV2 for better performance. For Layout Analysis model, we introduce
ultra light-weight detector PP-PicoDet and knowledge distillation algorithm FGD
for model lightweighting, which increased the inference speed by 11 times with
comparable mAP. For Table Recognition model, we utilize PP-LCNet, CSP-PAN and
SLAHead to optimize the backbone module, feature fusion module and decoding
module, respectively, which improved the table structure accuracy by 6\% with
comparable inference speed. For Key Information Extraction model, we introduce
VI-LayoutXLM which is a visual-feature independent LayoutXLM architecture,
TB-YX sorting algorithm and U-DML knowledge distillation algorithm, which
brought 2.8\% and 9.1\% improvement respectively on the Hmean of Semantic
Entity Recognition and Relation Extraction tasks. All the above mentioned
models and code are open-sourced in the GitHub repository PaddleOCR.
Related papers
- Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
For the model structure, we design a UNet architecture optimized for binarization.
We propose the consistent-pixel-downsample (CP-Down) and consistent-pixel-upsample (CP-Up) to maintain dimension consistent.
Comprehensive experiments demonstrate that our BI-DiffSR outperforms existing binarization methods.
arXiv Detail & Related papers (2024-06-09T10:30:25Z) - Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable.
CMDPs serve as an important framework to model many real-world applications with time-varying environments.
We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z) - CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly
Detection [49.510604614688745]
We propose a framework called CLIP-AD to leverage the zero-shot capabilities of the large vision-language model CLIP.
We note opposite predictions and irrelevant highlights in the direct computation of the anomaly maps.
arXiv Detail & Related papers (2023-11-01T11:39:22Z) - Binarized Spectral Compressive Imaging [59.18636040850608]
Existing deep learning models for hyperspectral image (HSI) reconstruction achieve good performance but require powerful hardwares with enormous memory and computational resources.
We propose a novel method, Binarized Spectral-Redistribution Network (BiSRNet)
BiSRNet is derived by using the proposed techniques to binarize the base model.
arXiv Detail & Related papers (2023-05-17T15:36:08Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - Exploring Multimodal Sentiment Analysis via CBAM Attention and
Double-layer BiLSTM Architecture [3.9850392954445875]
In our model, we use BERT + BiLSTM as new feature extractor to capture the long-distance dependencies in sentences.
To remove redundant information, CNN and CBAM attention are added after splicing text features and picture features.
The experimental results show that our model achieves a sound effect, similar to the advanced model.
arXiv Detail & Related papers (2023-03-26T12:34:01Z) - Extracting Motion and Appearance via Inter-Frame Attention for Efficient
Video Frame Interpolation [46.23787695590861]
We propose a novel module to explicitly extract motion and appearance information via a unifying operation.
Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction.
For both fixed- and arbitrary-timestep, our method achieves state-of-the-art performance on various datasets.
arXiv Detail & Related papers (2023-03-01T12:00:15Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - VSR: A Unified Framework for Document Layout Analysis combining Vision,
Semantics and Relations [40.721146438291335]
We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations.
On three popular benchmarks, VSR outperforms previous models by large margins.
arXiv Detail & Related papers (2021-05-13T12:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.