MeXtract: Light-Weight Metadata Extraction from Scientific Papers
- URL: http://arxiv.org/abs/2510.06889v1
- Date: Wed, 08 Oct 2025 11:12:28 GMT
- Title: MeXtract: Light-Weight Metadata Extraction from Scientific Papers
- Authors: Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem,
- Abstract summary: We present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers.<n>MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark.<n>We release all the code, datasets, and models openly for the research community.
- Score: 48.73595915402094
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.
Related papers
- LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX-16M and LimiX-2M treat structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [48.73595915402094]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - A Generative AI-driven Metadata Modelling Approach [1.450405446885067]
This paper proposes a Generative AI-driven Human-Large Language Model (LLM) collaboration based metadata modelling approach to disentangle the entanglement inherent in each representation level leading to the generation of a conceptually disentangled metadata model.
arXiv Detail & Related papers (2024-12-13T09:26:04Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.<n>LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.<n>We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - Learning from models beyond fine-tuning [78.20895343699658]
Learn From Model (LFM) focuses on the research, modification, and design of foundation models (FM) based on the model interface.<n>The study of LFM techniques can be broadly categorized into five major areas: model tuning, model distillation, model reuse, meta learning and model editing.<n>This paper gives a comprehensive review of the current methods based on FM from the perspective of LFM.
arXiv Detail & Related papers (2023-10-12T10:20:36Z) - Adaptable Text Matching via Meta-Weight Regulator [14.619068650513917]
Meta-Weight Regulator (MWR) is a meta-learning approach that learns to assign weights to the source examples based on their relevance to the target loss.
MWR first trains the model on uniformly weighted source examples, and measures the efficacy of the model on the target examples via a loss function.
As MWR is model-agnostic, it can be applied to any backbone neural model.
arXiv Detail & Related papers (2022-04-27T02:28:40Z) - Multimodal Approach for Metadata Extraction from German Scientific
Publications [0.0]
We propose a multimodal deep learning approach for metadata extraction from scientific papers in the German language.
We consider multiple types of input data by combining natural language processing and image vision processing.
Our model for this approach was trained on a dataset consisting of around 8800 documents and is able to obtain an overall F1-score of 0.923.
arXiv Detail & Related papers (2021-11-10T15:19:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.