RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
- URL: http://arxiv.org/abs/2403.14442v1
- Date: Thu, 21 Mar 2024 14:47:12 GMT
- Title: RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
- Authors: Yufan Chen, Jiaming Zhang, Kunyu Peng, Junwei Zheng, Ruiping Liu, Philip Torr, Rainer Stiefelhagen,
- Abstract summary: We introduce a robustness benchmark for Document Layout Analysis (DLA) models, which includes 450K document images of three datasets.
To cover realistic corruptions, we propose a perturbation taxonomy with 36 common document perturbations inspired by real-world document processing.
To better understand document perturbation impacts, we propose two metrics, Mean Perturbation Effect (mPE) for perturbation assessment and Mean Robustness Degradation (mRD) for robustness evaluation.
- Score: 32.52120363558076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Before developing a Document Layout Analysis (DLA) model in real-world applications, conducting comprehensive robustness testing is essential. However, the robustness of DLA models remains underexplored in the literature. To address this, we are the first to introduce a robustness benchmark for DLA models, which includes 450K document images of three datasets. To cover realistic corruptions, we propose a perturbation taxonomy with 36 common document perturbations inspired by real-world document processing. Additionally, to better understand document perturbation impacts, we propose two metrics, Mean Perturbation Effect (mPE) for perturbation assessment and Mean Robustness Degradation (mRD) for robustness evaluation. Furthermore, we introduce a self-titled model, i.e., Robust Document Layout Analyzer (RoDLA), which improves attention mechanisms to boost extraction of robust features. Experiments on the proposed benchmarks (PubLayNet-P, DocLayNet-P, and M$^6$Doc-P) demonstrate that RoDLA obtains state-of-the-art mRD scores of 115.7, 135.4, and 150.4, respectively. Compared to previous methods, RoDLA achieves notable improvements in mAP of +3.8%, +7.1% and +12.1%, respectively.
Related papers
- LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z) - MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm [60.14048367611333]
MonkeyOCR is a vision-language model for document parsing.<n>It advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm.
arXiv Detail & Related papers (2025-06-05T16:34:57Z) - DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model [25.504170988714783]
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability.<n>We propose DvD, the first generative model to tackle document textbfDewarping textbfvia a textbfDiffusion framework.
arXiv Detail & Related papers (2025-05-28T05:05:51Z) - Document Attribution: Examining Citation Relationships using Large Language Models [62.46146670035751]
We propose a zero-shot approach that frames attribution as a straightforward textual entailment task.<n>We also explore the role of the attention mechanism in enhancing the attribution process.
arXiv Detail & Related papers (2025-05-09T04:40:11Z) - YourBench: Easy Custom Evaluation Sets for Everyone [12.995134931278056]
YourBench is a novel, open-source framework for evaluating large language models (LLMs)
It generates reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation.
We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces.
arXiv Detail & Related papers (2025-04-02T15:40:24Z) - A RAG-Based Institutional Assistant [0.1499944454332829]
We design and evaluate a RAG-based virtual assistant specifically tailored for the University of Sao Paulo.
Our optimal retriever model achieves a Top-5 accuracy of 30%, while our most effective generative model scores 22.04% against ground truth answers.
arXiv Detail & Related papers (2025-01-23T17:54:19Z) - Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings.
Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z) - Robustness of Structured Data Extraction from In-plane Rotated Documents using Multi-Modal Large Language Models (LLM) [0.0]
This study investigates the impact of document skew on the data extraction accuracy of three state-of-the-art multi-modal models.
We identify the safe in-plane rotation angles (SIPRA) for each model and investigate the effects of skew on model hallucinations.
arXiv Detail & Related papers (2024-06-13T08:55:01Z) - LLMs Can Patch Up Missing Relevance Judgments in Evaluation [56.51461892988846]
We use large language models (LLMs) to automatically label unjudged documents.
We simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks.
Our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicuna-7B and GPT-3.5 Turbo respectively.
arXiv Detail & Related papers (2024-05-08T00:32:19Z) - Patch-Level Contrasting without Patch Correspondence for Accurate and
Dense Contrastive Representation Learning [79.43940012723539]
ADCLR is a self-supervised learning framework for learning accurate and dense vision representation.
Our approach achieves new state-of-the-art performance for contrastive methods.
arXiv Detail & Related papers (2023-06-23T07:38:09Z) - Measuring the Robustness of NLP Models to Domain Shifts [50.89876374569385]
Existing research on Domain Robustness (DR) suffers from disparate setups, limited task variety, and scarce research on recent capabilities such as in-context learning.
Current research focuses on challenge sets and relies solely on the Source Drop (SD): Using the source in-domain performance as a reference point for degradation.
We argue that the Target Drop (TD), which measures degradation from the target in-domain performance, should be used as a complementary point of view.
arXiv Detail & Related papers (2023-05-31T20:25:08Z) - Information Redundancy and Biases in Public Document Information
Extraction Benchmarks [0.0]
Despite the good performance of KIE models when fine-tuned on public benchmarks, they still struggle to generalize on complex real-life use-cases lacking sufficient document annotations.
Our research highlighted that KIE standard benchmarks such as SROIE and FUNSD contain significant similarity between training and testing documents and can be adjusted to better evaluate the generalization of models.
arXiv Detail & Related papers (2023-04-28T15:48:26Z) - A Multi-Document Coverage Reward for RELAXed Multi-Document
Summarization [11.02198476454955]
We propose fine-tuning an MDS baseline with a reward that balances a reference-based metric with coverage of the input documents.
Experimental results over the Multi-News and WCEP MDS datasets show significant improvements of up to +0.95 pp average ROUGE score and +3.17 pp METEOR score over the baseline.
arXiv Detail & Related papers (2022-03-06T07:33:01Z) - SAIS: Supervising and Augmenting Intermediate Steps for Document-Level
Relation Extraction [51.27558374091491]
We propose to explicitly teach the model to capture relevant contexts and entity types by supervising and augmenting intermediate steps (SAIS) for relation extraction.
Based on a broad spectrum of carefully designed tasks, our proposed SAIS method not only extracts relations of better quality due to more effective supervision, but also retrieves the corresponding supporting evidence more accurately.
arXiv Detail & Related papers (2021-09-24T17:37:35Z) - SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval [11.38022203865326]
SPLADE model provides highly sparse representations and competitive results with respect to state-of-the-art dense and sparse approaches.
We modify the pooling mechanism, benchmark a model solely based on document expansion, and introduce models trained with distillation.
Overall, SPLADE is considerably improved with more than $9$% gains on NDCG@10 on TREC DL 2019, leading to state-of-the-art results on the BEIR benchmark.
arXiv Detail & Related papers (2021-09-21T10:43:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.