LAME: Layout Aware Metadata Extraction Approach for Research Articles
- URL: http://arxiv.org/abs/2112.12353v1
- Date: Thu, 23 Dec 2021 04:23:08 GMT
- Title: LAME: Layout Aware Metadata Extraction Approach for Research Articles
- Authors: Jongyun Choi, Hyesoo Kong, Hwamook Yoon, Heung-Seon Oh, Yuchul Jung
- Abstract summary: The volume of academic literature, such as academic conference papers and journals, has increased rapidly worldwide.
High-performing metadata extraction is still challenging due to diverse layout formats according to journal publishers.
We propose a novel LAyout-aware Metadata Extraction framework equipped with the three characteristics.
- Score: 1.8899300124593648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The volume of academic literature, such as academic conference papers and
journals, has increased rapidly worldwide, and research on metadata extraction
is ongoing. However, high-performing metadata extraction is still challenging
due to diverse layout formats according to journal publishers. To accommodate
the diversity of the layouts of academic journals, we propose a novel
LAyout-aware Metadata Extraction (LAME) framework equipped with the three
characteristics (e.g., design of an automatic layout analysis, construction of
a large meta-data training set, and construction of Layout-MetaBERT). We
designed an automatic layout analysis using PDFMiner. Based on the layout
analysis, a large volume of metadata-separated training data, including the
title, abstract, author name, author affiliated organization, and keywords,
were automatically extracted. Moreover, we constructed Layout-MetaBERT to
extract the metadata from academic journals with varying layout formats. The
experimental results with Layout-MetaBERT exhibited robust performance
(Macro-F1, 93.27%) in metadata extraction for unseen journals with different
layout formats.
Related papers
- Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? [49.688233418425995]
Struc-Bench is a comprehensive benchmark featuring prominent Large Language Models (LLMs)
We propose two innovative metrics, P-Score (Prompting Score) and H-Score (Heuristical Score)
Our experiments show that applying our structure-aware fine-tuning to LLaMA-7B leads to substantial performance gains.
arXiv Detail & Related papers (2023-09-16T11:31:58Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - Are Layout-Infused Language Models Robust to Layout Distribution Shifts?
A Case Study with Scientific Documents [54.744701806413204]
Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers.
We test whether layout-infused LMs are robust to layout distribution shifts.
arXiv Detail & Related papers (2023-06-01T18:01:33Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - A Multi-Format Transfer Learning Model for Event Argument Extraction via
Variational Information Bottleneck [68.61583160269664]
Event argument extraction (EAE) aims to extract arguments with given roles from texts.
We propose a multi-format transfer learning model with variational information bottleneck.
We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.
arXiv Detail & Related papers (2022-08-27T13:52:01Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z) - MexPub: Deep Transfer Learning for Metadata Extraction from German
Publications [1.1549572298362785]
We present a method that extracts metadata from PDF documents with different layouts and styles by viewing the document as an image.
Our method achieved an average accuracy of around $90%$ which validates its capability to accurately extract metadata from a variety of PDF documents.
arXiv Detail & Related papers (2021-06-04T09:43:48Z) - Hierarchical Metadata-Aware Document Categorization under Weak
Supervision [32.80303008934164]
We develop HiMeCat, an embedding-based generative framework for our task.
We propose a novel joint representation learning module that allows simultaneous modeling of category dependencies.
We introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set.
arXiv Detail & Related papers (2020-10-26T13:07:56Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.