ScanBank: A Benchmark Dataset for Figure Extraction from Scanned
Electronic Theses and Dissertations
- URL: http://arxiv.org/abs/2106.15320v1
- Date: Wed, 23 Jun 2021 04:43:56 GMT
- Title: ScanBank: A Benchmark Dataset for Figure Extraction from Scanned
Electronic Theses and Dissertations
- Authors: Sampanna Yashwant Kahu, William A. Ingram, Edward A. Fox, Jian Wu
- Abstract summary: We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility.
Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs.
To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images.
We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs.
- Score: 3.4252676314771144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We focus on electronic theses and dissertations (ETDs), aiming to improve
access and expand their utility, since more than 6 million are publicly
available, and they constitute an important corpus to aid research and
education across disciplines. The corpus is growing as new born-digital
documents are included, and since millions of older theses and dissertations
have been converted to digital form to be disseminated electronically in
institutional repositories. In ETDs, as with other scholarly works, figures and
tables can communicate a large amount of information in a concise way. Although
methods have been proposed for extracting figures and tables from born-digital
PDFs, they do not work well with scanned ETDs. Considering this problem, our
assessment of state-of-the-art figure extraction systems is that the reason
they do not function well on scanned PDFs is that they have only been trained
on born-digital documents. To address this limitation, we present ScanBank, a
new dataset containing 10 thousand scanned page images, manually labeled by
humans as to the presence of the 3.3 thousand figures or tables found therein.
We use this dataset to train a deep neural network model based on YOLOv5 to
accurately extract figures and tables from scanned ETDs. We pose and answer
important research questions aimed at finding better methods for figure
extraction from scanned documents. One of those concerns the value for
training, of data augmentation techniques applied to born-digital documents
which are used to train models better suited for figure extraction from scanned
documents. To the best of our knowledge, ScanBank is the first manually
annotated dataset for figure and table extraction for scanned ETDs. A
YOLOv5-based model, trained on ScanBank, outperforms existing comparable
open-source and freely available baseline methods by a considerable margin.
Related papers
- A Novel Dataset for Non-Destructive Inspection of Handwritten Documents [0.0]
Forensic handwriting examination aims to examine handwritten documents in order to properly define or hypothesize the manuscript's author.
We propose a new and challenging dataset consisting of two subsets: the first consists of 21 documents written either by the classic pen and paper" approach (and later digitized) and directly acquired on common devices such as tablets.
Preliminary results on the proposed datasets show that 90% classification accuracy can be achieved on the first subset.
arXiv Detail & Related papers (2024-01-09T09:25:58Z) - An Innovative Tool for Uploading/Scraping Large Image Datasets on Social
Networks [9.27070946719462]
We propose an automated approach by means of a digital tool that we created on purpose.
The tool is capable of automatically uploading an entire image dataset to the desired digital platform and then downloading all the uploaded pictures.
arXiv Detail & Related papers (2023-11-01T23:27:37Z) - Unveiling Document Structures with YOLOv5 Layout Detection [0.0]
This research investigates the utilization of YOLOv5, a cutting-edge computer vision model, for the purpose of rapidly identifying document layouts and extracting unstructured data.
The main objective is to create an autonomous system that can effectively recognize document layouts and extract unstructured data.
arXiv Detail & Related papers (2023-09-29T07:45:10Z) - SCoDA: Domain Adaptive Shape Completion for Real Scans [78.92028595499245]
3D shape completion from point clouds is a challenging task, especially from scans of real-world objects.
We propose a new task, SCoDA, for the domain adaptation of real scan shape completion from synthetic data.
We propose a novel cross-domain feature fusion method for knowledge transfer and a novel volume-consistent self-training framework for robust learning from real data.
arXiv Detail & Related papers (2023-04-20T09:38:26Z) - Data-Free Sketch-Based Image Retrieval [56.96186184599313]
We propose Data-Free (DF)-SBIR, where pre-trained, single-modality classification models have to be leveraged to learn cross-modal metric-space for retrieval without access to any training data.
We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches.
Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data.
arXiv Detail & Related papers (2023-03-14T10:34:07Z) - DiT: Self-supervised Pre-training for Document Image Transformer [85.78807512344463]
We propose DiT, a self-supervised pre-trained Document Image Transformer model.
We leverage DiT as the backbone network in a variety of vision-based Document AI tasks.
Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-03-04T15:34:46Z) - DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z) - Towards Deep Learning-based 6D Bin Pose Estimation in 3D Scans [0.0]
This paper focuses on a specific task of 6D pose estimation of a bin in 3D scans.
We present a high-quality dataset composed of synthetic data and real scans captured by a structured-light scanner with precise annotations.
arXiv Detail & Related papers (2021-12-17T16:19:06Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Automatic Metadata Extraction Incorporating Visual Features from Scanned
Electronic Theses and Dissertations [3.1354625918296612]
Electronic Theses and (ETDs) contain domain knowledge that can be used for many digital library tasks.
Traditional sequence tagging methods mainly rely on text-based features.
We propose a conditional random field (CRF) model that combines text-based and visual features.
arXiv Detail & Related papers (2021-07-01T14:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.