Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models
- URL: http://arxiv.org/abs/2510.05142v1
- Date: Wed, 01 Oct 2025 22:03:28 GMT
- Title: Reliable End-to-End Material Information Extraction from the Literature with Source-Tracked Multi-Stage Large Language Models
- Authors: Xin Wang, Anshu Raj, Matthew Luebbe, Haiming Wen, Shuozhi Xu, Kun Lu,
- Abstract summary: We propose a multi-stage information extraction pipeline powered by large language models.<n>It captures 47 features spanning composition, structure processing, and properties exclusively from experimentally reported materials.<n>The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability.
- Score: 3.3552980412055216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-driven materials discovery requires large-scale experimental datasets, yet most of the information remains trapped in unstructured literature. Existing extraction efforts often focus on a limited set of features and have not addressed the integrated composition-processing-microstructure-property relationships essential for understanding materials behavior, thereby posing challenges for building comprehensive databases. To address this gap, we propose a multi-stage information extraction pipeline powered by large language models, which captures 47 features spanning composition, processing, microstructure, and properties exclusively from experimentally reported materials. The pipeline integrates iterative extraction with source tracking to enhance both accuracy and reliability. Evaluations at the feature level (independent attributes) and tuple level (interdependent features) yielded F1 scores around 0.96. Compared with single-pass extraction without source tracking, our approach improved F1 scores of microstructure category by 10.0% (feature level) and 13.7% (tuple level), and reduced missed materials from 49 to 13 out of 396 materials in 100 articles on precipitate-containing multi-principal element alloys (miss rate reduced from 12.4% to 3.3%). The pipeline enables scalable and efficient literature mining, producing databases with high precision, minimal omissions, and zero false positives. These datasets provide trustworthy inputs for machine learning and materials informatics, while the modular design generalizes to diverse material classes, enabling comprehensive materials information extraction.
Related papers
- ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature [0.2447206672789868]
ComProScanner is an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of chemical compositions and properties.<n>We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models.<n>DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82.
arXiv Detail & Related papers (2025-10-23T09:01:44Z) - Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention [6.938202451113495]
We present a novel framework that combines an extraction model based on MatSciBERT with pointer and an allocation model.<n>Our experiments on extraction demonstrate impressive F1 scores of 0.947, 0.93 and 0.753 across datasets.<n>These results highlight the model's capacity to deliver precise and structured information.
arXiv Detail & Related papers (2025-03-10T02:39:06Z) - DARWIN 1.5: Large Language Models as Materials Science Adapted Learners [46.7259033847682]
We propose DARWIN 1.5, the largest open-source large language model tailored for materials science.<n> DARWIN eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery.<n>Our approach integrates 6M material domain papers and 21 experimental datasets from 49,256 materials across modalities while enabling cross-task knowledge transfer.
arXiv Detail & Related papers (2024-12-16T16:51:27Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Foundation Model for Composite Microstructures: Reconstruction, Stiffness, and Nonlinear Behavior Prediction [0.0]
We present the Material Masked Autoencoder (MMAE), a self-supervised Vision Transformer pretrained on a large corpus of short-fiber composite images.<n>We demonstrate two key applications: (i) predicting homogenized stiffness components through fine-tuning on limited data, and (ii) inferring physically interpretable parameters by coupling MMAE with an interaction-based material network.
arXiv Detail & Related papers (2024-11-10T19:06:25Z) - Towards Enhancing Coherence in Extractive Summarization: Dataset and Experiments with LLMs [70.15262704746378]
We propose a systematically created human-annotated dataset consisting of coherent summaries for five publicly available datasets and natural language user feedback.
Preliminary experiments with Falcon-40B and Llama-2-13B show significant performance improvements (10% Rouge-L) in terms of producing coherent summaries.
arXiv Detail & Related papers (2024-07-05T20:25:04Z) - Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction [0.0]
PropertyExtractor is an open-source tool that blends zero-shot with few-shot in-context learning.
Our tests on material data demonstrate precision and recall that exceed 95% with an error rate of approximately 9%.
arXiv Detail & Related papers (2024-05-16T21:15:51Z) - Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing [5.527358421206627]
We present a simulation of various active learning strategies for the discovery of polymer solar cell donor/acceptor pairs.
Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation.
arXiv Detail & Related papers (2024-02-29T18:54:46Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - A general-purpose material property data extraction pipeline from large
polymer corpora using Natural Language Processing [4.688077134982731]
We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature.
We obtained 300,000 material property records from 130,000 abstracts in 60 hours.
The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells.
arXiv Detail & Related papers (2022-09-27T03:47:03Z) - Understanding Programmatic Weak Supervision via Source-aware Influence
Function [76.74549130841383]
Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels.
We build on Influence Function (IF) to decompose the end model's training objective and then calculate the influence associated with each (data, source, class)
These primitive influence score can then be used to estimate the influence of individual component PWS, such as source vote, supervision source, and training data.
arXiv Detail & Related papers (2022-05-25T15:57:24Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.