Related papers: Revisiting Table Detection Datasets for Visually Rich Documents

Revisiting Table Detection Datasets for Visually Rich Documents

URL: http://arxiv.org/abs/2305.04833v2
Date: Wed, 8 Nov 2023 16:53:43 GMT
Title: Revisiting Table Detection Datasets for Visually Rich Documents
Authors: Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
Abstract summary: This study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables. To enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets. Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources.
Score: 17.846536373106268
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Table Detection has become a fundamental task for visually rich document understanding with the surging number of electronic documents. However, popular public datasets widely used in related studies have inherent limitations, including noisy and inconsistent samples, limited training samples, and limited data sources. These limitations make these datasets unreliable to evaluate the model performance and cannot reflect the actual capacity of models. Therefore, this study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables. Moreover, to enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets. To ensure the label quality of the dataset, we annotated the dataset manually following the guidance of a domain expert. The proposed dataset is challenging and can be a sample of actual cases in the business context. We built strong baselines using various state-of-the-art object detection models. Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources. Our proposed Open-Tables and ICT-TD can provide a more reliable evaluation for models because of their high quality and consistent annotations. Besides, they are more suitable for cross-domain settings. Our experimental results show that in the cross-domain setting, benchmark models trained with cleaned Open-Tables dataset can achieve 0.6\%-2.6\% higher weighted average F1 than the corresponding ones trained with the noisy version of Open-Tables, demonstrating the reliability of the proposed datasets. The datasets are public available.

Related papers

Detection of Personal Data in Structured Datasets Using a Large Language Model [0.0]
We propose a novel approach for detecting personal data in structured datasets, leveraging GPT-4o.<n>We compare our approach to alternative methods, including Microsoft Presidio and CASSED, evaluating them on multiple datasets.
arXiv Detail & Related papers (2025-06-27T15:16:43Z)
Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z)
A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance. Data selection has shown promise in identifying the most representative samples from the entire dataset. We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z)
The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts [0.0]
This paper introduces the MERIT dataset, a fully labeled dataset within the context of school reports. By its nature, the MERIT dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs) To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models.
arXiv Detail & Related papers (2024-08-31T12:56:38Z)
A Language Model-Guided Framework for Mining Time Series with Distributional Shifts [5.082311792764403]
This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets. While obtained from external sources, the collected data share critical statistical properties with primary time series datasets. It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution.
arXiv Detail & Related papers (2024-06-07T20:21:07Z)
RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization [36.973388673687815]
RanLayNet is a synthetic document dataset enriched with automatically assigned labels. We show that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents.
arXiv Detail & Related papers (2024-04-15T07:50:15Z)
UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria. We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets. We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z)
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z)
dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the Test [0.6827423171182154]
"dacl1k" is a multi-label RCD dataset for multi-label classification based on building inspections including 1,474 images. We trained the models on different combinations of open-source data (meta datasets) which were subsequently evaluated both extrinsically and intrinsically. The performance analysis on dacl1k shows practical usability of the meta data, where the best model shows an Exact Match Ratio of 32%.
arXiv Detail & Related papers (2023-09-07T15:05:35Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned. It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.