Related papers: MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature

MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature

URL: http://arxiv.org/abs/2009.06819v2
Date: Sat, 23 Jan 2021 03:30:55 GMT
Title: MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature
Authors: Souradip Guha, Ankan Mullick, Jatin Agrawal, Swetarekha Ram, Samir Ghui, Seung-Cheol Lee, Satadeep Bhattacharjee, Pawan Goyal
Abstract summary: MatScIE can extract relevant information from material science literature and make a structured database. Users can upload published articles and view/download the information obtained from this tool.
Score: 5.217605474243695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The number of published articles in the field of materials science is growing rapidly every year. This comparatively unstructured data source, which contains a large amount of information, has a restriction on its re-usability, as the information needed to carry out further calculations using the data in it must be extracted manually. It is very important to obtain valid and contextually correct information from the online (offline) data, as it can be useful not only to generate inputs for further calculations, but also to incorporate them into a querying framework. Retaining this context as a priority, we have developed an automated tool, MatScIE (Material Scince Information Extractor) that can extract relevant information from material science literature and make a structured database that is much easier to use for material simulations. Specifically, we extract the material details, methods, code, parameters, and structure from the various research articles. Finally, we created a web application where users can upload published articles and view/download the information obtained from this tool and can create their own databases for their personal uses.

Related papers

LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature [60.879220305044726]
We propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data.<n>We curated 81k open-access papers, yielding LeMat- Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes.<n>We release a modular, open-source library designed to support community-driven extension to new corpora and synthesis domains.
arXiv Detail & Related papers (2025-10-28T17:58:18Z)
Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles [81.89404347890662]
SciTrek is a novel question-answering benchmark designed to evaluate the long-context reasoning capabilities of large language models (LLMs) using scientific articles.<n>Our analysis reveals systematic shortcomings in models' abilities to perform basic numerical operations and accurately locate specific information in long contexts.
arXiv Detail & Related papers (2025-09-25T11:36:09Z)
Materials Generation in the Era of Artificial Intelligence: A Comprehensive Survey [54.40267149907223]
Materials are the foundation of modern society, underpinning advancements in energy, electronics, healthcare, transportation, and infrastructure.<n>The ability to discover and design new materials with tailored properties is critical to solving some of the most pressing global challenges.<n>Data-driven generative models provide a powerful tool for materials design by directly create novel materials that satisfy predefined property requirements.
arXiv Detail & Related papers (2025-05-22T08:33:21Z)
Towards an automated workflow in materials science for combining multi-modal simulative and experimental information using data mining and large language models [0.0]
This manuscript showcases an automated workflow, which unravels the encoded information from scientific literature to a machine-readable database. Ultimately, a Retrieval-Augmented Generation (RAG) based Large Language Model (LLM) enables a fast and efficient question answering chat bot.
arXiv Detail & Related papers (2025-02-18T16:24:46Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets. We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z)
From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language. Structured data is crucial for innovative and systematic materials design. The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z)
Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction [0.0]
PropertyExtractor is an open-source tool that blends zero-shot with few-shot in-context learning. Our tests on material data demonstrate precision and recall that exceed 95% with an error rate of approximately 9%.
arXiv Detail & Related papers (2024-05-16T21:15:51Z)
Query of CC: Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models. The method bootstraps seed information through a large language model and retrieves related data from public corpora. It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z)
Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text. Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z)
Instruct and Extract: Instruction Tuning for On-Demand Information Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users. We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set. Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z)
Interactive Distillation of Large Single-Topic Corpora of Scientific Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z)
Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT [9.33544942080883]
This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science. We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release. We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
arXiv Detail & Related papers (2023-04-05T04:01:52Z)
CateCom: a practical data-centric approach to categorization of computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models. We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z)
ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge. We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text. We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.