Structured Extraction of Process Structure Properties Relationships in Materials Science
- URL: http://arxiv.org/abs/2504.03979v1
- Date: Fri, 04 Apr 2025 22:44:02 GMT
- Title: Structured Extraction of Process Structure Properties Relationships in Materials Science
- Authors: Amit K Verma, Zhisong Zhang, Junwon Seo, Robin Kuo, Runbo Jiang, Emma Strubell, Anthony D Rollett,
- Abstract summary: We introduce a novel annotation schema designed to extract generic process-structure-properties relationships from scientific literature.<n>We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains.<n>Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I.
- Score: 10.10021626682367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of large language models (LLMs), the vast unstructured text within millions of academic papers is increasingly accessible for materials discovery, although significant challenges remain. While LLMs offer promising few- and zero-shot learning capabilities, particularly valuable in the materials domain where expert annotations are scarce, general-purpose LLMs often fail to address key materials-specific queries without further adaptation. To bridge this gap, fine-tuning LLMs on human-labeled data is essential for effective structured knowledge extraction. In this study, we introduce a novel annotation schema designed to extract generic process-structure-properties relationships from scientific literature. We demonstrate the utility of this approach using a dataset of 128 abstracts, with annotations drawn from two distinct domains: high-temperature materials (Domain I) and uncertainty quantification in simulating materials microstructure (Domain II). Initially, we developed a conditional random field (CRF) model based on MatBERT, a domain-specific BERT variant, and evaluated its performance on Domain I. Subsequently, we compared this model with a fine-tuned LLM (GPT-4o from OpenAI) under identical conditions. Our results indicate that fine-tuning LLMs can significantly improve entity extraction performance over the BERT-CRF baseline on Domain I. However, when additional examples from Domain II were incorporated, the performance of the BERT-CRF model became comparable to that of the GPT-4o model. These findings underscore the potential of our schema for structured knowledge extraction and highlight the complementary strengths of both modeling approaches.
Related papers
- Enhancing Domain-Specific Encoder Models with LLM-Generated Data: How to Leverage Ontologies, and How to Do Without Them [9.952432291248954]
We investigate the use of LLM-generated data for continual pretraining of encoder models in domains with limited data.
We compile a benchmark specifically designed for assessing embedding model performance in invasion biology.
Our results demonstrate that this approach achieves a fully automated pipeline for enhancing domain-specific understanding of small encoder models.
arXiv Detail & Related papers (2025-03-27T21:51:24Z) - Causal Discovery from Data Assisted by Large Language Models [50.193740129296245]
It is essential to integrate experimental data with prior domain knowledge for knowledge driven discovery.<n>Here we demonstrate this approach by combining high-resolution scanning transmission electron microscopy (STEM) data with insights derived from large language models (LLMs)<n>By fine-tuning ChatGPT on domain-specific literature, we construct adjacency matrices for Directed Acyclic Graphs (DAGs) that map the causal relationships between structural, chemical, and polarization degrees of freedom in Sm-doped BiFeO3 (SmBFO)
arXiv Detail & Related papers (2025-03-18T02:14:49Z) - Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity [61.48338027901318]
We show that fine-tuning with LLM-generated data improves target task performance and reduces out-of-domain degradation.
This is the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - Aggregated Knowledge Model: Enhancing Domain-Specific QA with Fine-Tuned and Retrieval-Augmented Generation Models [0.0]
This paper introduces a novel approach to enhancing closed-domain Question Answering (QA) systems.
It focuses on the specific needs of the Lawrence Berkeley National Laboratory (LBL) Science Information Technology (ScienceIT) domain.
arXiv Detail & Related papers (2024-10-24T00:49:46Z) - Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift.<n>We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z) - A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models [1.3927943269211591]
This paper experiments with leveraging in-context learning capabilities of Large Language Models to perform data annotation.
We show that by using a few-shot learning strategy with structured prompts and only minimal expert annotation the presented approach can potentially support domain adaptation of a science KG generation model.
arXiv Detail & Related papers (2024-08-05T11:06:36Z) - ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models [25.68491572293656]
Large Language Models fall short in structured knowledge extraction tasks such as named entity recognition.
This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets.
arXiv Detail & Related papers (2024-03-17T06:12:43Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Mining experimental data from Materials Science literature with Large Language Models: an evaluation study [1.9849264945671101]
This study is dedicated to assessing the capabilities of large language models (LLMs) in extracting structured information from scientific documents in materials science.
We focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities.
The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline)
arXiv Detail & Related papers (2024-01-19T23:00:31Z) - EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models
with Semi-structured Data [67.8302955948861]
Large Language Models (LLMs) pre-trained on massive corpora have exhibited remarkable performance on various NLP tasks.
Applying these models to specific domains still poses significant challenges, such as lack of domain knowledge.
We focus on domain-specific continual pre-training of LLMs using E-commerce domain as an exemplar.
arXiv Detail & Related papers (2023-12-25T11:31:47Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.