A Large Dataset of Historical Japanese Documents with Complex Layouts
- URL: http://arxiv.org/abs/2004.08686v1
- Date: Sat, 18 Apr 2020 18:38:25 GMT
- Title: A Large Dataset of Historical Japanese Documents with Complex Layouts
- Authors: Zejiang Shen, Kaixuan Zhang, Melissa Dell
- Abstract summary: HJDataset is a large dataset of historical Japanese documents with complex layouts.
It contains over 250,000 layout element annotations seven types.
A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors.
- Score: 5.343406649012619
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning-based approaches for automatic document layout analysis and
content extraction have the potential to unlock rich information trapped in
historical documents on a large scale. One major hurdle is the lack of large
datasets for training robust models. In particular, little training data exist
for Asian languages. To this end, we present HJDataset, a Large Dataset of
Historical Japanese Documents with Complex Layouts. It contains over 250,000
layout element annotations of seven types. In addition to bounding boxes and
masks of the content regions, it also includes the hierarchical structures and
reading orders for layout elements. The dataset is constructed using a
combination of human and machine efforts. A semi-rule based method is developed
to extract the layout elements, and the results are checked by human
inspectors. The resulting large-scale dataset is used to provide baseline
performance analyses for text region detection using state-of-the-art deep
learning models. And we demonstrate the usefulness of the dataset on real-world
document digitization tasks. The dataset is available at
https://dell-research-harvard.github.io/HJDataset/.
Related papers
- Diachronic Document Dataset for Semantic Layout Analysis [9.145289299764991]
This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials.
By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure.
We evaluate object detection models on this dataset, examining the impact of input size and subset-based training.
arXiv Detail & Related papers (2024-11-15T09:33:13Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models [0.9065034043031668]
We present a pipeline for image extraction from historical documents using foundation models.
We evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity.
arXiv Detail & Related papers (2023-09-04T15:37:03Z) - The mapKurator System: A Complete Pipeline for Extracting and Linking
Text from Historical Maps [7.209761597734092]
mapKurator is an end-to-end system integrating machine learning models with a comprehensive data processing pipeline.
We deployed the mapKurator system and enabled the processing of over 60,000 maps and over 100 million text/place names in the David Rumsey Historical Map collection.
arXiv Detail & Related papers (2023-06-29T16:05:40Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - SDL: New data generation tools for full-level annotated document layout [0.0]
We present a novel data generation tool for document processing.
The tool focuses on providing a maximal level of visual information in a normal type document.
It also enables working with a large dataset on low-resource languages.
arXiv Detail & Related papers (2021-06-29T06:32:31Z) - docExtractor: An off-the-shelf historical document element extraction [18.828438308738495]
We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents.
We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets.
We introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.
arXiv Detail & Related papers (2020-12-15T10:19:18Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.