A general-purpose material property data extraction pipeline from large
polymer corpora using Natural Language Processing
- URL: http://arxiv.org/abs/2209.13136v1
- Date: Tue, 27 Sep 2022 03:47:03 GMT
- Title: A general-purpose material property data extraction pipeline from large
polymer corpora using Natural Language Processing
- Authors: Pranav Shetty, Arunkumar Chitteth Rajan, Christopher Kuenneth,
Sonkakshi Gupta, Lakshmi Prerana Panchumarti, Lauren Holm, Chao Zhang, and
Rampi Ramprasad
- Abstract summary: We used natural language processing methods to automatically extract material property data from the abstracts of polymer literature.
We obtained 300,000 material property records from 130,000 abstracts in 60 hours.
The extracted data was analyzed for a diverse range of applications such as fuel cells, supercapacitors, and polymer solar cells.
- Score: 4.688077134982731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ever-increasing number of materials science articles makes it hard to
infer chemistry-structure-property relations from published literature. We used
natural language processing (NLP) methods to automatically extract material
property data from the abstracts of polymer literature. As a component of our
pipeline, we trained MaterialsBERT, a language model, using 2.4 million
materials science abstracts, which outperforms other baseline models in three
out of five named entity recognition datasets when used as the encoder for
text. Using this pipeline, we obtained ~300,000 material property records from
~130,000 abstracts in 60 hours. The extracted data was analyzed for a diverse
range of applications such as fuel cells, supercapacitors, and polymer solar
cells to recover non-trivial insights. The data extracted through our pipeline
is made available through a web platform at https://polymerscholar.org which
can be used to locate material property data recorded in abstracts
conveniently. This work demonstrates the feasibility of an automatic pipeline
that starts from published literature and ends with a complete set of extracted
material property information.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction [0.0]
PropertyExtractor is an open-source tool that blends zero-shot with few-shot in-context learning.
Our tests on material data demonstrate precision and recall that exceed 95% with an error rate of approximately 9%.
arXiv Detail & Related papers (2024-05-16T21:15:51Z) - Accelerating materials discovery for polymer solar cells: Data-driven insights enabled by natural language processing [5.527358421206627]
We present a simulation of various active learning strategies for the discovery of polymer solar cell donor/acceptor pairs.
Our approach demonstrates a potential reduction in discovery time by approximately 75 %, equivalent to a 15 year acceleration in material innovation.
arXiv Detail & Related papers (2024-02-29T18:54:46Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - 1.5 million materials narratives generated by chatbots [25.125848842769464]
We have generated a dataset of 1,494,017 natural language-material paragraphs based on combined OQMD, Materials Project, JARVIS, COD and AFLOW2 databases.
The generated text narratives were then polled and scored by both human experts and ChatGPT-4, based on three rubrics: technical accuracy, language and structure, and relevance and depth of content.
arXiv Detail & Related papers (2023-08-25T22:00:53Z) - Application of Transformers based methods in Electronic Medical Records:
A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - Analyzing Research Trends in Inorganic Materials Literature Using NLP [8.645705008293838]
This study proposes a large-scale natural language processing (NLP) pipeline for extracting material names and properties from materials science literature.
We build a corpus containing 836 annotated paragraphs extracted from 301 papers for training a named entity recognition (NER) model.
Experimental results demonstrate the utility of this NER model; it achieves successful extraction with a micro-F1 score of 78.1%.
arXiv Detail & Related papers (2021-06-27T06:29:10Z) - MatScIE: An automated tool for the generation of databases of methods
and parameters used in the computational materials science literature [5.217605474243695]
MatScIE can extract relevant information from material science literature and make a structured database.
Users can upload published articles and view/download the information obtained from this tool.
arXiv Detail & Related papers (2020-09-15T01:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.