Related papers: Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3

Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3

URL: http://arxiv.org/abs/2304.13846v1
Date: Wed, 26 Apr 2023 22:21:33 GMT
Title: Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from Literature with GPT-3
Authors: Nicholas Walker, John Dagdelen, Kevin Cruse, Sanghoon Lee, Samuel Gleason, Alexander Dunn, Gerbrand Ceder, A. Paul Alivisatos, Kristin A. Persson, Anubhav Jain
Abstract summary: We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures. We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
Score: 52.59930033705221
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although gold nanorods have been the subject of much research, the pathways for controlling their shape and thereby their optical properties remain largely heuristically understood. Although it is apparent that the simultaneous presence of and interaction between various reagents during synthesis control these properties, computational and experimental approaches for exploring the synthesis space can be either intractable or too time-consuming in practice. This motivates an alternative approach leveraging the wealth of synthesis information already embedded in the body of scientific literature by developing tools to extract relevant structured data in an automated, high-throughput manner. To that end, we present an approach using the powerful GPT-3 language model to extract structured multi-step seed-mediated growth procedures and outcomes for gold nanorods from unstructured scientific text. GPT-3 prompt completions are fine-tuned to predict synthesis templates in the form of JSON documents from unstructured text input with an overall accuracy of $86\%$. The performance is notable, considering the model is performing simultaneous entity recognition and relation extraction. We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.

Related papers

Syntactic Control of Language Models by Posterior Inference [53.823006836309695]
Controlling the syntactic structure of text generated by language models is valuable for applications requiring clarity, stylistic consistency, or interpretability.<n>We argue that sampling algorithms based on the posterior inference can effectively enforce a target constituency structure during generation.<n>Our approach combines sequential Monte Carlo, which estimates the posterior distribution by sampling from a proposal distribution, with a syntactic tagger that ensures that each generated token aligns with the desired syntactic structure.
arXiv Detail & Related papers (2025-06-08T14:01:34Z)
Autonomous nanoparticle synthesis by design [32.63291717930695]
We introduce an autonomous approach explicitly targeting synthesis of atomic-scale structures.<n>Our method autonomously designs synthesis protocols by matching real time experimental total scattering (TS) and pair distribution function (PDF) data.<n>We demonstrate this capability at a synchrotron, successfully synthesising two structurally distinct gold NPs.
arXiv Detail & Related papers (2025-05-19T13:19:30Z)
Reshaping MOFs text mining with a dynamic multi-agents framework of large language model [5.150905688058796]
We present MOFh6, a large language model (LLM)-based multi-agent system designed to extract, structure, and apply synthesis knowledge.<n>MoFh6 achieves 99% accuracy in synthesis data parsing and resolves 94.1% of complex co-reference abbreviations.<n>It processes a single full-text document in 9.6 seconds and localizes structured synthesis descriptions within 36 seconds, with the cost per 100 papers reduced to USD 4.24, a 76% saving over existing systems.
arXiv Detail & Related papers (2025-04-26T09:55:04Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
BAPULM: Binding Affinity Prediction using Language Models [7.136205674624813]
We introduce BAPULM, an innovative sequence-based framework that leverages the chemical latent representations of proteins via ProtT5-XL-U50 and through MolFormer. Our approach was validated extensively on benchmark datasets, achieving sequential scoring power (R) values of 0.925 $pm$ 0.043, 0.914 $pm$ 0.004, and 0.8132 $pm$ 0.001 on benchmark1k2101, Test2016_290, and CSAR-HiQ_36, respectively.
arXiv Detail & Related papers (2024-11-06T04:35:30Z)
ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis [80.34000499166648]
We propose a Graph-based Sampling strategy to sample more relevant tool combinations, and a Planned-generation strategy to create plans that guide the synthesis of coherent dialogues. We apply SFT on LLaMA-3.1-8B using 8,000 synthetic dialogues generated with ToolFlow. Results show that the model achieves tool-calling performance comparable to or even surpassing GPT-4, while maintaining strong general capabilities.
arXiv Detail & Related papers (2024-10-24T05:45:04Z)
SynthFormer: Equivariant Pharmacophore-based Generation of Molecules for Ligand-Based Drug Design [1.3927943269211591]
This paper addresses the gap between in silico generative approaches and practical in vitro methodologies. We introduce SynthFormer, a novel ML model that utilizes a 3D equivariant encoder for pharmacophores to generate fully synthesizable molecules. Our contributions include a new methodology for efficient chemical space exploration using 3D information, a novel architecture called Synthformer for translating 3D pharmacophore representations into molecules, and a meaningful embedding space that organizes reagents for drug discovery optimization.
arXiv Detail & Related papers (2024-10-03T17:38:46Z)
BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction [65.93303145891628]
BatGPT-Chem is a large language model with 15 billion parameters, tailored for enhanced retrosynthesis prediction. Our model captures a broad spectrum of chemical knowledge, enabling precise prediction of reaction conditions. This development empowers chemists to adeptly address novel compounds, potentially expediting the innovation cycle in drug manufacturing and materials science.
arXiv Detail & Related papers (2024-08-19T05:17:40Z)
Compositional Deep Probabilistic Models of DNA Encoded Libraries [6.206196935093064]
We introduce a compositional deep probabilistic model of DEL data, DEL-Compose, which decomposes molecular representations into their mono-synthon, di-synthon, and tri-synthon building blocks. Our model demonstrates strong performance compared to count baselines, enriches the correct pharmacophores, and offers valuable insights via its intrinsic interpretable structure.
arXiv Detail & Related papers (2023-10-20T19:04:28Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
Structured information extraction from complex scientific text with fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction. The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts. This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
Interdisciplinary Discovery of Nanomaterials Based on Convolutional Neural Networks [6.350788459498522]
We use CNN to discover valuable experimental-based information about nanomaterials and synthesis methods in energy-material-related publications. Our first system, TextMaster, extracts opinions from texts and classifies them into challenges and opportunities, achieving 94% and 92% accuracy, respectively. Our second system, GraphMaster, realizes data extraction of tables and figures from publications with 98.3% classification accuracy and 4.3% data extraction mean square error.
arXiv Detail & Related papers (2022-12-06T07:51:51Z)
PcMSP: A Dataset for Scientific Action Graphs Extraction from Polycrystalline Materials Synthesis Procedure Text [1.9573380763700712]
This dataset simultaneously contains the synthesis sentences extracted from the experimental paragraphs, as well as the entity mentions and intra-sentence relations. A two-step human annotation and inter-annotator agreement study guarantee the high quality of the PcMSP corpus. We introduce four natural language processing tasks: sentence classification, named entity recognition, relation classification, and joint extraction of entities and relations.
arXiv Detail & Related papers (2022-10-22T09:43:54Z)
Machine-Learning-Optimized Perovskite Nanoplatelet Synthesis [55.41644538483948]
We develop an algorithm to improve the quality of CsPbBr3 nanoplatelets (NPLs) using only 200 total syntheses. The algorithm can predict the resulting PL emission maxima of the NPL dispersions based on the precursor ratios.
arXiv Detail & Related papers (2022-10-18T11:54:11Z)
Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature [10.443499579567069]
We present a novel corpus of the synthesis process for all-solid-state batteries and an automated machine reading system. We define the representation of the synthesis processes using flow graphs, and create a corpus from the experimental sections of 243 papers. The automated machine-reading system is developed by a deep learning-based sequence tagger and simple rule-based relation extractor.
arXiv Detail & Related papers (2020-02-18T02:30:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.