MexPub: Deep Transfer Learning for Metadata Extraction from German
Publications
- URL: http://arxiv.org/abs/2106.07359v1
- Date: Fri, 4 Jun 2021 09:43:48 GMT
- Title: MexPub: Deep Transfer Learning for Metadata Extraction from German
Publications
- Authors: Zeyd Boukhers and Nada Beili and Timo Hartmann and Prantik Goswami and
Muhammad Arslan Zafar
- Abstract summary: We present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image.
Our method achieved an average accuracy of around $90%$ which validates its capability to accurately extract metadata from a variety of PDF documents.
- Score: 1.1549572298362785
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Extracting metadata from scientific papers can be considered a solved problem
in NLP due to the high accuracy of state-of-the-art methods. However, this does
not apply to German scientific publications, which have a variety of styles and
layouts. In contrast to most of the English scientific publications that follow
standard and simple layouts, the order, content, position and size of metadata
in German publications vary greatly among publications. This variety makes
traditional NLP methods fail to accurately extract metadata from these
publications. In this paper, we present a method that extracts metadata from
PDF documents with different layouts and styles by viewing the document as an
image. We used Mask R-CNN that is trained on COCO dataset and finetuned with
PubLayNet dataset that consists of ~200K PDF snapshots with five basic classes
(e.g. text, figure, etc). We refine-tuned the model on our proposed synthetic
dataset consisting of ~30K article snapshots to extract nine patterns (i.e.
author, title, etc). Our synthetic dataset is generated using contents in both
languages German and English and a finite set of challenging templates obtained
from German publications. Our method achieved an average accuracy of around
$90\%$ which validates its capability to accurately extract metadata from a
variety of PDF documents with challenging templates.
Related papers
- PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Document understanding is a challenging task to process and comprehend large amounts of textual and visual information.
Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.
We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - Recovering document annotations for sentence-level bitext [18.862295675088056]
We reconstruct document-level information for three datasets in German, French, Spanish, Italian, Polish, and Portuguese.
We introduce a document-level filtering technique as an alternative to traditional bitext filtering.
Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation.
arXiv Detail & Related papers (2024-06-06T08:58:14Z) - Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines [1.174020933567308]
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan.
Current Optical Character Recognition (OCR) systems are unable to extract text from historical documents as they have many issues.
In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages.
arXiv Detail & Related papers (2024-04-09T08:08:03Z) - PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents [4.191058827240492]
We present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records.
We evaluate the efficacy of transformer-based OCR models when trained on this resource.
arXiv Detail & Related papers (2024-03-23T05:20:36Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - On Search Strategies for Document-Level Neural Machine Translation [51.359400776242786]
Document-level neural machine translation (NMT) models produce a more consistent output across a document.
In this work, we aim to answer the question how to best utilize a context-aware translation model in decoding.
arXiv Detail & Related papers (2023-06-08T11:30:43Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Automatic Metadata Extraction Incorporating Visual Features from Scanned
Electronic Theses and Dissertations [3.1354625918296612]
Electronic Theses and (ETDs) contain domain knowledge that can be used for many digital library tasks.
Traditional sequence tagging methods mainly rely on text-based features.
We propose a conditional random field (CRF) model that combines text-based and visual features.
arXiv Detail & Related papers (2021-07-01T14:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.