Automatic Metadata Extraction Incorporating Visual Features from Scanned
Electronic Theses and Dissertations
- URL: http://arxiv.org/abs/2107.00516v1
- Date: Thu, 1 Jul 2021 14:59:18 GMT
- Title: Automatic Metadata Extraction Incorporating Visual Features from Scanned
Electronic Theses and Dissertations
- Authors: Muntabir Hasan Choudhury, Himarsha R. Jayanetti, Jian Wu, William A.
Ingram, Edward A. Fox
- Abstract summary: Electronic Theses and (ETDs) contain domain knowledge that can be used for many digital library tasks.
Traditional sequence tagging methods mainly rely on text-based features.
We propose a conditional random field (CRF) model that combines text-based and visual features.
- Score: 3.1354625918296612
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Electronic Theses and Dissertations (ETDs) contain domain knowledge that can
be used for many digital library tasks, such as analyzing citation networks and
predicting research trends. Automatic metadata extraction is important to build
scalable digital library search engines. Most existing methods are designed for
born-digital documents, so they often fail to extract metadata from scanned
documents such as for ETDs. Traditional sequence tagging methods mainly rely on
text-based features. In this paper, we propose a conditional random field (CRF)
model that combines text-based and visual features. To verify the robustness of
our model, we extended an existing corpus and created a new ground truth corpus
consisting of 500 ETD cover pages with human validated metadata. Our
experiments show that CRF with visual features outperformed both a heuristic
and a CRF model with only text-based features. The proposed model achieved
81.3%-96% F1 measure on seven metadata fields. The data and source code are
publicly available on Google Drive (https://tinyurl.com/y8kxzwrp) and a GitHub
repository (https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf),
respectively.
Related papers
- CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Geometric Perception based Efficient Text Recognition [0.0]
In real-world applications with fixed camera positions, the underlying data tends to be regular scene text.
This paper introduces the underlying concepts, theory, implementation, and experiment results to develop specialized models.
We introduce a novel deep learning architecture (GeoTRNet), trained to identify digits in a regular scene image, only using the geometrical features present.
arXiv Detail & Related papers (2023-02-08T04:19:24Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z) - ScanBank: A Benchmark Dataset for Figure Extraction from Scanned
Electronic Theses and Dissertations [3.4252676314771144]
We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility.
Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs.
To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images.
We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs.
arXiv Detail & Related papers (2021-06-23T04:43:56Z) - Key Information Extraction From Documents: Evaluation And Generator [3.878105750489656]
This research project compares state-of-the-art models for information extraction from documents.
The results have shown that NLP based pre-processing is beneficial for model performance.
The use of a bounding box regression decoder increases the model performance only for fields that do not follow a rectangular shape.
arXiv Detail & Related papers (2021-06-09T16:12:21Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.