Related papers: Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents

URL: http://arxiv.org/abs/2501.05082v1
Date: Thu, 09 Jan 2025 09:03:43 GMT
Title: Comparison of Feature Learning Methods for Metadata Extraction from PDF Scholarly Documents
Authors: Zeyd Boukhers, Cong Yang,
Abstract summary: This study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance.<n>We aim to improve the accessibility of scientific documents and facilitate their wider use.
Score: 8.516310581591426
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The availability of metadata for scientific documents is pivotal in propelling scientific knowledge forward and for adhering to the FAIR principles (i.e. Findability, Accessibility, Interoperability, and Reusability) of research findings. However, the lack of sufficient metadata in published documents, particularly those from smaller and mid-sized publishers, hinders their accessibility. This issue is widespread in some disciplines, such as the German Social Sciences, where publications often employ diverse templates. To address this challenge, our study evaluates various feature learning and prediction methods, including natural language processing (NLP), computer vision (CV), and multimodal approaches, for extracting metadata from documents with high template variance. We aim to improve the accessibility of scientific documents and facilitate their wider use. To support our comparison of these methods, we provide comprehensive experimental results, analyzing their accuracy and efficiency in extracting metadata. Additionally, we provide valuable insights into the strengths and weaknesses of various feature learning and prediction methods, which can guide future research in this field.

Related papers

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z)
Personalized Multimodal Large Language Models: A Survey [127.9521218125761]
Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications.
arXiv Detail & Related papers (2024-12-03T03:59:03Z)
A Survey of Small Language Models [104.80308007044634]
Small Language Models (SLMs) have become increasingly important due to their efficiency and performance to perform various language tasks with minimal computational resources. We present a comprehensive survey on SLMs, focusing on their architectures, training techniques, and model compression techniques.
arXiv Detail & Related papers (2024-10-25T23:52:28Z)
Multimodal Methods for Analyzing Learning and Training Environments: A Systematic Literature Review [3.0712840129998513]
This literature review proposes a taxonomy and framework that encapsulates recent methodological advances in this field. We introduce a novel data fusion category -- mid fusion -- and a graph-based technique for refining literature reviews, termed citation graph pruning. There remains a need for further research to bridge the divide between multimodal learning and training studies and foundational AI research.
arXiv Detail & Related papers (2024-08-22T22:42:23Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations. We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models. The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
SyROCCo: Enhancing Systematic Reviews using Machine Learning [6.805429133535976]
This paper explores the use of machine learning techniques to help navigate the systematic review process. The application of ML techniques to subsequent stages of a review, such as data extraction and evidence mapping, is in its infancy.
arXiv Detail & Related papers (2024-06-24T11:04:43Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Making Metadata More FAIR Using Large Language Models [2.61630828688114]
This work presents a Natural Language Processing (NLP) informed application, called FAIRMetaText, that compares metadata. Specifically, FAIRMetaText analyzes the natural language descriptions of metadata and provides a mathematical similarity measure between two terms. This software can drastically reduce the human effort in sifting through various natural language metadata while employing several experimental datasets on the same topic.
arXiv Detail & Related papers (2023-07-24T19:14:38Z)
Application of Transformers based methods in Electronic Medical Records: A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z)
The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces [54.2590226904332]
We describe the Semantic Reader Project, a effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Ten prototype interfaces have been developed and more than 300 participants and real-world users have shown improved reading experiences. We structure this paper around challenges scholars and the public face when reading research papers.
arXiv Detail & Related papers (2023-03-25T02:47:09Z)
Multimodal Approach for Metadata Extraction from German Scientific Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language. We consider multiple types of input data by combining natural language processing and image vision processing. Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z)
CitationIE: Leveraging the Citation Graph for Scientific Information Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers. We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
Generating Knowledge Graphs by Employing Natural Language Processing and Machine Learning Techniques within the Scholarly Domain [1.9004296236396943]
We present a new architecture that takes advantage of Natural Language Processing and Machine Learning methods for extracting entities and relationships from research publications. Within this research work, we i) tackle the challenge of knowledge extraction by employing several state-of-the-art Natural Language Processing and Text Mining tools. We generated a scientific knowledge graph including 109,105 triples, extracted from 26,827 abstracts of papers within the Semantic Web domain.
arXiv Detail & Related papers (2020-10-28T08:31:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.