Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT
- URL: http://arxiv.org/abs/2304.02213v5
- Date: Wed, 12 Apr 2023 14:06:02 GMT
- Title: Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT
- Authors: Tong Xie, Yuwei Wan, Wei Huang, Yufei Zhou, Yixuan Liu, Qingyuan
Linghu, Shaozhou Wang, Chunyu Kit, Clara Grazian, Wenjie Zhang and Bram Hoex
- Abstract summary: This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science.
We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release.
We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
- Score: 9.33544942080883
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The amount of data has growing significance in exploring cutting-edge
materials and a number of datasets have been generated either by hand or
automated approaches. However, the materials science field struggles to
effectively utilize the abundance of data, especially in applied disciplines
where materials are evaluated based on device performance rather than their
properties. This article presents a new natural language processing (NLP) task
called structured information inference (SII) to address the complexities of
information extraction at the device level in materials science. We
accomplished this task by tuning GPT-3 on an existing perovskite solar cell
FAIR (Findable, Accessible, Interoperable, Reusable) dataset with 91.8%
F1-score and extended the dataset with data published since its release. The
produced data is formatted and normalized, enabling its direct utilization as
input in subsequent data analysis. This feature empowers materials scientists
to develop models by selecting high-quality review articles within their
domain. Additionally, we designed experiments to predict the electrical
performance of solar cells and design materials or devices with targeted
parameters using large language models (LLMs). Our results demonstrate
comparable performance to traditional machine learning methods without feature
selection, highlighting the potential of LLMs to acquire scientific knowledge
and design new materials akin to materials scientists.
Related papers
- Foundation Model for Composite Materials and Microstructural Analysis [49.1574468325115]
We present a foundation model specifically designed for composite materials.
Our model is pre-trained on a dataset of short-fiber composites to learn robust latent features.
During transfer learning, the MMAE accurately predicts homogenized stiffness, with an R2 score reaching as high as 0.959 and consistently exceeding 0.91, even when trained on limited data.
arXiv Detail & Related papers (2024-11-10T19:06:25Z) - Synthetic Data Generation with Large Language Models for Personalized Community Question Answering [47.300506002171275]
We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities.
Our findings suggest that LLMs have high potential in generating data tailored to users' needs.
The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.
arXiv Detail & Related papers (2024-10-29T16:19:08Z) - From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language.
Structured data is crucial for innovative and systematic materials design.
The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction [0.0]
PropertyExtractor is an open-source tool that blends zero-shot with few-shot in-context learning.
Our tests on material data demonstrate precision and recall that exceed 95% with an error rate of approximately 9%.
arXiv Detail & Related papers (2024-05-16T21:15:51Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Mining experimental data from Materials Science literature with Large Language Models: an evaluation study [1.9849264945671101]
This study is dedicated to assessing the capabilities of large language models (LLMs) in extracting structured information from scientific documents in materials science.
We focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities.
The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline)
arXiv Detail & Related papers (2024-01-19T23:00:31Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - Accelerated materials language processing enabled by GPT [5.518792725397679]
We develop generative transformer (GPT)-enabled pipelines for materials language processing.
First, we develop a GPT-enabled document classification method for screening relevant documents.
Secondly, for NER task, we design an entity-centric prompts, and learning few-shot of them improved the performance.
Finally, we develop an GPT-enabled extractive QA model, which provides improved performance and shows the possibility of automatically correcting annotations.
arXiv Detail & Related papers (2023-08-18T07:31:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.