MatScIE: An automated tool for the generation of databases of methods
and parameters used in the computational materials science literature
- URL: http://arxiv.org/abs/2009.06819v2
- Date: Sat, 23 Jan 2021 03:30:55 GMT
- Title: MatScIE: An automated tool for the generation of databases of methods
and parameters used in the computational materials science literature
- Authors: Souradip Guha, Ankan Mullick, Jatin Agrawal, Swetarekha Ram, Samir
Ghui, Seung-Cheol Lee, Satadeep Bhattacharjee, Pawan Goyal
- Abstract summary: MatScIE can extract relevant information from material science literature and make a structured database.
Users can upload published articles and view/download the information obtained from this tool.
- Score: 5.217605474243695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The number of published articles in the field of materials science is growing
rapidly every year. This comparatively unstructured data source, which contains
a large amount of information, has a restriction on its re-usability, as the
information needed to carry out further calculations using the data in it must
be extracted manually. It is very important to obtain valid and contextually
correct information from the online (offline) data, as it can be useful not
only to generate inputs for further calculations, but also to incorporate them
into a querying framework. Retaining this context as a priority, we have
developed an automated tool, MatScIE (Material Scince Information Extractor)
that can extract relevant information from material science literature and make
a structured database that is much easier to use for material simulations.
Specifically, we extract the material details, methods, code, parameters, and
structure from the various research articles. Finally, we created a web
application where users can upload published articles and view/download the
information obtained from this tool and can create their own databases for
their personal uses.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language.
Structured data is crucial for innovative and systematic materials design.
The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z) - Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction [0.0]
PropertyExtractor is an open-source tool that blends zero-shot with few-shot in-context learning.
Our tests on material data demonstrate precision and recall that exceed 95% with an error rate of approximately 9%.
arXiv Detail & Related papers (2024-05-16T21:15:51Z) - Query of CC: Unearthing Large Scale Domain-Specific Knowledge from
Public Corpora [104.16648246740543]
We propose an efficient data collection method based on large language models.
The method bootstraps seed information through a large language model and retrieves related data from public corpora.
It not only collects knowledge-related data for specific domains but unearths the data with potential reasoning procedures.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT [9.33544942080883]
This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science.
We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release.
We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
arXiv Detail & Related papers (2023-04-05T04:01:52Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.