LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature
- URL: http://arxiv.org/abs/2510.26824v1
- Date: Tue, 28 Oct 2025 17:58:18 GMT
- Title: LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature
- Authors: Magdalena Lederbauer, Siddharth Betala, Xiyao Li, Ayush Jain, Amine Sehaba, Georgia Channing, Grégoire Germain, Anamaria Leonescu, Faris Flaifil, Alfonso Amayuelas, Alexandre Nozadze, Stefan P. Schmid, Mohd Zaki, Sudheesh Kumar Ethirajan, Elton Pan, Mathilde Franckel, Alexandre Duval, N. M. Anoop Krishnan, Samuel P. Gleason,
- Abstract summary: We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
- Score: 60.879220305044726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.
Related papers
- OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas [57.49565459553627]
We introduce OmniStruct, a benchmark for assessing Large Language Models' capabilities on text-to-structure tasks.<n>We collect high-quality training data via synthetic task generation to facilitate the development of efficient text-to-structure models.<n>Our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models.
arXiv Detail & Related papers (2025-11-23T08:18:12Z) - ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature [0.2447206672789868]
ComProScanner is an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of chemical compositions and properties.<n>We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models.<n>DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82.
arXiv Detail & Related papers (2025-10-23T09:01:44Z) - Language Native Lightly Structured Databases for Large Language Model Driven Composite Materials Research [6.31777560888658]
Preparation procedures of materials are often embedded narratively in experimental protocols, research articles, patents, and laboratory notes.<n>We reformulate this challenge into a text-reasoning problem through a framework centered on a text-first, lightly structured materials database.<n>We show how language-native data combined with LLM-based reasoning can significantly accelerate practical material preparation.
arXiv Detail & Related papers (2025-09-07T15:15:55Z) - MatPROV: A Provenance Graph Dataset of Material Synthesis Extracted from Scientific Literature [1.171928204630468]
We present MatPROV, a dataset of PROV-DM-compliant synthesis procedures extracted from scientific literature.<n>MatPROV captures structural complexities and causal relationships among materials, operations, and conditions through visually intuitive directed graphs.
arXiv Detail & Related papers (2025-09-01T00:47:27Z) - Towards Fully-Automated Materials Discovery via Large-Scale Synthesis Dataset and Expert-Level LLM-as-a-Judge [6.500470477634259]
Our work aims to support the materials science community by providing a practical, data-driven resource.<n>We have curated a comprehensive dataset of 17K expert-verified synthesis recipes from open-access literature.<n>AlchemicalBench offers an end-to-end framework that supports research in large language models applied to synthesis prediction.
arXiv Detail & Related papers (2025-02-23T06:16:23Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - Extracting Structured Seed-Mediated Gold Nanorod Growth Procedures from
Literature with GPT-3 [52.59930033705221]
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in 268 papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
We present a dataset of 11,644 entities extracted from 1,137 papers, resulting in papers with at least one complete seed-mediated gold nanorod growth procedure and outcome for a total of 332 complete procedures.
arXiv Detail & Related papers (2023-04-26T22:21:33Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - ULSA: Unified Language of Synthesis Actions for Representation of
Synthesis Protocols [2.436060325115753]
We propose the first Unified Language of Synthesis Actions (ULSA) for describing synthesis procedures.
We created a dataset of 3,040 synthesis procedures annotated by domain experts according to the proposed ULSA scheme.
arXiv Detail & Related papers (2022-01-23T17:44:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.