Related papers: PP-StructureV2: A Stronger Document Analysis System

PP-StructureV2: A Stronger Document Analysis System

URL: http://arxiv.org/abs/2210.05391v2
Date: Thu, 13 Oct 2022 07:11:59 GMT
Title: PP-StructureV2: A Stronger Document Analysis System
Authors: Chenxia Li, Ruoyu Guo, Jun Zhou, Mengtao An, Yuning Du, Lingfeng Zhu, Yi Liu, Xiaoguang Hu, Dianhai Yu
Abstract summary: A large amount of document data exists in unstructured form such as raw images without any text information. We propose PP-StructureV2, which contains two subsystems: Layout Information Extraction and Key Information Extraction. All the above mentioned models and code are open-sourced in the GitHub repository PaddleOCR.
Score: 9.846187457305879
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A large amount of document data exists in unstructured form such as raw images without any text information. Designing a practical document image analysis system is a meaningful but challenging task. In previous work, we proposed an intelligent document analysis system PP-Structure. In order to further upgrade the function and performance of PP-Structure, we propose PP-StructureV2 in this work, which contains two subsystems: Layout Information Extraction and Key Information Extraction. Firstly, we integrate Image Direction Correction module and Layout Restoration module to enhance the functionality of the system. Secondly, 8 practical strategies are utilized in PP-StructureV2 for better performance. For Layout Analysis model, we introduce ultra light-weight detector PP-PicoDet and knowledge distillation algorithm FGD for model lightweighting, which increased the inference speed by 11 times with comparable mAP. For Table Recognition model, we utilize PP-LCNet, CSP-PAN and SLAHead to optimize the backbone module, feature fusion module and decoding module, respectively, which improved the table structure accuracy by 6\% with comparable inference speed. For Key Information Extraction model, we introduce VI-LayoutXLM which is a visual-feature independent LayoutXLM architecture, TB-YX sorting algorithm and U-DML knowledge distillation algorithm, which brought 2.8\% and 9.1\% improvement respectively on the Hmean of Semantic Entity Recognition and Relation Extraction tasks. All the above mentioned models and code are open-sourced in the GitHub repository PaddleOCR.

Related papers

PARL: Position-Aware Relation Learning Network for Document Layout Analysis [23.497081928689525]
We argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure.<n>We propose a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure.<n>Experiments show that PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models.
arXiv Detail & Related papers (2026-01-12T15:05:35Z)
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.05126590825121]
MonkeyOCR v1.5 is a unified vision-language framework that enhances both layout understanding and content recognition.<n>To address complex table structures, we propose a visual consistency-based reinforcement learning scheme.<n>Two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables.
arXiv Detail & Related papers (2025-11-13T15:12:17Z)
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
Resource-Efficient Affordance Grounding with Complementary Depth and Semantic Prompts [21.435113588059924]
Affordance refers to the functional properties that an agent perceives and utilizes from its environment. Existing multimodal affordance methods face limitations in extracting useful information. This paper proposes the BiT-Align image-depth-text affordance mapping framework.
arXiv Detail & Related papers (2025-03-04T13:20:42Z)
Binarized Diffusion Model for Image Super-Resolution [61.963833405167875]
Binarization, an ultra-compression algorithm, offers the potential for effectively accelerating advanced diffusion models (DMs) Existing binarization methods result in significant performance degradation. We introduce a novel binarized diffusion model, BI-DiffSR, for image SR.
arXiv Detail & Related papers (2024-06-09T10:30:25Z)
Sample Complexity Characterization for Linear Contextual MDPs [67.79455646673762]
Contextual decision processes (CMDPs) describe a class of reinforcement learning problems in which the transition kernels and reward functions can change over time with different MDPs indexed by a context variable. CMDPs serve as an important framework to model many real-world applications with time-varying environments. We study CMDPs under two linear function approximation models: Model I with context-varying representations and common linear weights for all contexts; and Model II with common representations for all contexts and context-varying linear weights.
arXiv Detail & Related papers (2024-02-05T03:25:04Z)
CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection [49.510604614688745]
We propose a framework called CLIP-AD to leverage the zero-shot capabilities of the large vision-language model CLIP. We note opposite predictions and irrelevant highlights in the direct computation of the anomaly maps.
arXiv Detail & Related papers (2023-11-01T11:39:22Z)
Binarized Spectral Compressive Imaging [59.18636040850608]
Existing deep learning models for hyperspectral image (HSI) reconstruction achieve good performance but require powerful hardwares with enormous memory and computational resources. We propose a novel method, Binarized Spectral-Redistribution Network (BiSRNet) BiSRNet is derived by using the proposed techniques to binarize the base model.
arXiv Detail & Related papers (2023-05-17T15:36:08Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Exploring Multimodal Sentiment Analysis via CBAM Attention and Double-layer BiLSTM Architecture [3.9850392954445875]
In our model, we use BERT + BiLSTM as new feature extractor to capture the long-distance dependencies in sentences. To remove redundant information, CNN and CBAM attention are added after splicing text features and picture features. The experimental results show that our model achieves a sound effect, similar to the advanced model.
arXiv Detail & Related papers (2023-03-26T12:34:01Z)
Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation [46.23787695590861]
We propose a novel module to explicitly extract motion and appearance information via a unifying operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. For both fixed- and arbitrary-timestep, our method achieves state-of-the-art performance on various datasets.
arXiv Detail & Related papers (2023-03-01T12:00:15Z)
PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result. Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations [40.721146438291335]
We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. On three popular benchmarks, VSR outperforms previous models by large margins.
arXiv Detail & Related papers (2021-05-13T12:20:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.