ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature
- URL: http://arxiv.org/abs/2510.20362v1
- Date: Thu, 23 Oct 2025 09:01:44 GMT
- Title: ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature
- Authors: Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni,
- Abstract summary: ComProScanner is an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of chemical compositions and properties.<n>We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models.<n>DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82.
- Score: 0.2447206672789868
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Since the advent of various pre-trained large language models, extracting structured knowledge from scientific text has experienced a revolutionary change compared with traditional machine learning or natural language processing techniques. Despite these advances, accessible automated tools that allow users to construct, validate, and visualise datasets from scientific literature extraction remain scarce. We therefore developed ComProScanner, an autonomous multi-agent platform that facilitates the extraction, validation, classification, and visualisation of machine-readable chemical compositions and properties, integrated with synthesis data from journal articles for comprehensive database creation. We evaluated our framework using 100 journal articles against 10 different LLMs, including both open-source and proprietary models, to extract highly complex compositions associated with ceramic piezoelectric materials and corresponding piezoelectric strain coefficients (d33), motivated by the lack of a large dataset for such materials. DeepSeek-V3-0324 outperformed all models with a significant overall accuracy of 0.82. This framework provides a simple, user-friendly, readily-usable package for extracting highly complex experimental data buried in the literature to build machine learning or deep learning datasets.
Related papers
- Exploring LLMs for Scientific Information Extraction Using The SciEx Framework [12.534492015126757]
Large language models (LLMs) are touted as powerful tools for automating scientific information extraction.<n>We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation.<n>We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently.
arXiv Detail & Related papers (2025-12-10T19:00:20Z) - LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature [60.879220305044726]
We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
arXiv Detail & Related papers (2025-10-28T17:58:18Z) - Compressive Meta-Learning [49.300635370079874]
Compressive learning is a framework that enables efficient processing by using random, non-linear features.<n>We propose a framework that meta-learns both the encoding and decoding stages of compressive learning methods.<n>We explore multiple applications -- including neural network-based compressive PCA, compressive ridge regression, compressive k-means, and autoencoders.
arXiv Detail & Related papers (2025-08-14T22:08:06Z) - Enhanced Multi-Tuple Extraction for Alloys: Integrating Pointer Networks and Augmented Attention [6.938202451113495]
We present a novel framework that combines an extraction model based on MatSciBERT with pointer and an allocation model.<n>Our experiments on extraction demonstrate impressive F1 scores of 0.947, 0.93 and 0.753 across datasets.<n>These results highlight the model's capacity to deliver precise and structured information.
arXiv Detail & Related papers (2025-03-10T02:39:06Z) - Towards an automated workflow in materials science for combining multi-modal simulative and experimental information using data mining and large language models [0.0]
This manuscript showcases an automated workflow, which unravels the encoded information from scientific literature to a machine-readable database.<n>Ultimately, a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) enables a fast and efficient question answering chat bot.
arXiv Detail & Related papers (2025-02-18T16:24:46Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - KAXAI: An Integrated Environment for Knowledge Analysis and Explainable
AI [0.0]
The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation.
The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability.
arXiv Detail & Related papers (2023-12-30T10:20:47Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - Compositional Representation of Polymorphic Crystalline Materials [56.80318252233511]
We introduce PCRL, a novel approach that employs probabilistic modeling of composition to capture the diverse polymorphs from available structural information.<n>Extensive evaluations on sixteen datasets demonstrate the effectiveness of PCRL in learning compositional representation.
arXiv Detail & Related papers (2023-11-17T20:34:28Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.