Revisiting Table Detection Datasets for Visually Rich Documents
- URL: http://arxiv.org/abs/2305.04833v2
- Date: Wed, 8 Nov 2023 16:53:43 GMT
- Title: Revisiting Table Detection Datasets for Visually Rich Documents
- Authors: Bin Xiao, Murat Simsek, Burak Kantarci, Ala Abu Alkheir
- Abstract summary: This study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables.
To enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets.
Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources.
- Score: 17.846536373106268
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Table Detection has become a fundamental task for visually rich document
understanding with the surging number of electronic documents. However, popular
public datasets widely used in related studies have inherent limitations,
including noisy and inconsistent samples, limited training samples, and limited
data sources. These limitations make these datasets unreliable to evaluate the
model performance and cannot reflect the actual capacity of models. Therefore,
this study revisits some open datasets with high-quality annotations,
identifies and cleans the noise, and aligns the annotation definitions of these
datasets to merge a larger dataset, termed Open-Tables. Moreover, to enrich the
data sources, we propose a new ICT-TD dataset using the PDF files of
Information and Communication Technologies (ICT) commodities, a different
domain containing unique samples that hardly appear in open datasets. To ensure
the label quality of the dataset, we annotated the dataset manually following
the guidance of a domain expert. The proposed dataset is challenging and can be
a sample of actual cases in the business context. We built strong baselines
using various state-of-the-art object detection models. Our experimental
results show that the domain differences among existing open datasets are minor
despite having different data sources. Our proposed Open-Tables and ICT-TD can
provide a more reliable evaluation for models because of their high quality and
consistent annotations. Besides, they are more suitable for cross-domain
settings. Our experimental results show that in the cross-domain setting,
benchmark models trained with cleaned Open-Tables dataset can achieve
0.6\%-2.6\% higher weighted average F1 than the corresponding ones trained with
the noisy version of Open-Tables, demonstrating the reliability of the proposed
datasets. The datasets are public available.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts [0.0]
This paper introduces the MERIT dataset, a fully labeled dataset within the context of school reports.
By its nature, the MERIT dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs)
To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models.
arXiv Detail & Related papers (2024-08-31T12:56:38Z) - A Language Model-Guided Framework for Mining Time Series with Distributional Shifts [5.082311792764403]
This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets.
While obtained from external sources, the collected data share critical statistical properties with primary time series datasets.
It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution.
arXiv Detail & Related papers (2024-06-07T20:21:07Z) - RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization [36.973388673687815]
RanLayNet is a synthetic document dataset enriched with automatically assigned labels.
We show that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents.
arXiv Detail & Related papers (2024-04-15T07:50:15Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - dacl1k: Real-World Bridge Damage Dataset Putting Open-Source Data to the
Test [0.6827423171182154]
"dacl1k" is a multi-label RCD dataset for multi-label classification based on building inspections including 1,474 images.
We trained the models on different combinations of open-source data (meta datasets) which were subsequently evaluated both extrinsically and intrinsically.
The performance analysis on dacl1k shows practical usability of the meta data, where the best model shows an Exact Match Ratio of 32%.
arXiv Detail & Related papers (2023-09-07T15:05:35Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.