Revisiting Table Detection Datasets for Visually Rich Documents
- URL: http://arxiv.org/abs/2305.04833v2
- Date: Wed, 8 Nov 2023 16:53:43 GMT
- Title: Revisiting Table Detection Datasets for Visually Rich Documents
- Authors: Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
- Abstract summary: This study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables.
To enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets.
Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources.
- Score: 17.846536373106268
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Table Detection has become a fundamental task for visually rich document
understanding with the surging number of electronic documents. However, popular
public datasets widely used in related studies have inherent limitations,
including noisy and inconsistent samples, limited training samples, and limited
data sources. These limitations make these datasets unreliable to evaluate the
model performance and cannot reflect the actual capacity of models. Therefore,
this study revisits some open datasets with high-quality annotations,
identifies and cleans the noise, and aligns the annotation definitions of these
datasets to merge a larger dataset, termed Open-Tables. Moreover, to enrich the
data sources, we propose a new ICT-TD dataset using the PDF files of
Information and Communication Technologies (ICT) commodities, a different
domain containing unique samples that hardly appear in open datasets. To ensure
the label quality of the dataset, we annotated the dataset manually following
the guidance of a domain expert. The proposed dataset is challenging and can be
a sample of actual cases in the business context. We built strong baselines
using various state-of-the-art object detection models. Our experimental
results show that the domain differences among existing open datasets are minor
despite having different data sources. Our proposed Open-Tables and ICT-TD can
provide a more reliable evaluation for models because of their high quality and
consistent annotations. Besides, they are more suitable for cross-domain
settings. Our experimental results show that in the cross-domain setting,
benchmark models trained with cleaned Open-Tables dataset can achieve
0.6\%-2.6\% higher weighted average F1 than the corresponding ones trained with
the noisy version of Open-Tables, demonstrating the reliability of the proposed
datasets. The datasets are public available.
Related papers
- A Language Model-Guided Framework for Mining Time Series with Distributional Shifts [5.082311792764403]
This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets.
While obtained from external sources, the collected data share critical statistical properties with primary time series datasets.
It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution.
arXiv Detail & Related papers (2024-06-07T20:21:07Z) - RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization [36.973388673687815]
RanLayNet is a synthetic document dataset enriched with automatically assigned labels.
We show that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents.
arXiv Detail & Related papers (2024-04-15T07:50:15Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the
Test [0.6827423171182154]
"dacl1k" is a multi-label RCD dataset for multi-label classification based on building inspections including 1,474 images.
We trained the models on different combinations of open-source data (meta datasets) which were subsequently evaluated both extrinsically and intrinsically.
The performance analysis on dacl1k shows practical usability of the meta data, where the best model shows an Exact Match Ratio of 32%.
arXiv Detail & Related papers (2023-09-07T15:05:35Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - On the Composition and Limitations of Publicly Available COVID-19 X-Ray
Imaging Datasets [0.0]
Data scarcity, mismatch between training and target population, group imbalance, and lack of documentation are important sources of bias.
This paper presents an overview of the currently public available COVID-19 chest X-ray datasets.
arXiv Detail & Related papers (2020-08-26T14:16:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.