Related papers: A Large Dataset of Historical Japanese Documents with Complex Layouts

A Large Dataset of Historical Japanese Documents with Complex Layouts

URL: http://arxiv.org/abs/2004.08686v1
Date: Sat, 18 Apr 2020 18:38:25 GMT
Title: A Large Dataset of Historical Japanese Documents with Complex Layouts
Authors: Zejiang Shen, Kaixuan Zhang, Melissa Dell
Abstract summary: HJDataset is a large dataset of historical Japanese documents with complex layouts. It contains over 250,000 layout element annotations seven types. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors.
Score: 5.343406649012619
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks. The dataset is available at https://dell-research-harvard.github.io/HJDataset/.

Related papers

AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization [0.0]
The AnnoPage dataset is a collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present. The dataset is designed to support research in document layout analysis and object detection.
arXiv Detail & Related papers (2025-03-28T15:30:42Z)
BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction [0.0]
BiblioPage is a dataset of scanned title pages annotated with structured metadata. The dataset consists of approximately 2,000 title pages collected from 14 Czech libraries. We valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59.
arXiv Detail & Related papers (2025-03-25T13:46:55Z)
Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality [67.67387254989018]
We study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude.
arXiv Detail & Related papers (2025-03-10T21:51:17Z)
Multi-Record Web Page Information Extraction From News Websites [83.88591755871734]
In this paper, we focus on the problem of extracting information from web pages containing many records. To address this gap, we created a large-scale, open-access dataset specifically designed for list pages. Our dataset contains 13,120 web pages with news lists, significantly exceeding existing datasets in both scale and complexity.
arXiv Detail & Related papers (2025-02-20T15:05:00Z)
Diachronic Document Dataset for Semantic Layout Analysis [9.145289299764991]
This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training.
arXiv Detail & Related papers (2024-11-15T09:33:13Z)
Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining. We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure. This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z)
Prompt me a Dataset: An investigation of text-image prompting for historical image dataset creation using foundation models [0.9065034043031668]
We present a pipeline for image extraction from historical documents using foundation models. We evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity.
arXiv Detail & Related papers (2023-09-04T15:37:03Z)
The mapKurator System: A Complete Pipeline for Extracting and Linking Text from Historical Maps [7.209761597734092]
mapKurator is an end-to-end system integrating machine learning models with a comprehensive data processing pipeline. We deployed the mapKurator system and enabled the processing of over 60,000 maps and over 100 million text/place names in the David Rumsey Historical Map collection.
arXiv Detail & Related papers (2023-06-29T16:05:40Z)
Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents. LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents. Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP. The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z)
SDL: New data generation tools for full-level annotated document layout [0.0]
We present a novel data generation tool for document processing. The tool focuses on providing a maximal level of visual information in a normal type document. It also enables working with a large dataset on low-resource languages.
arXiv Detail & Related papers (2021-06-29T06:32:31Z)
docExtractor: An off-the-shelf historical document element extraction [18.828438308738495]
We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets. We introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.
arXiv Detail & Related papers (2020-12-15T10:19:18Z)
Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG) It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains. Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.