Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction
- URL: http://arxiv.org/abs/2310.08383v3
- Date: Fri, 26 Apr 2024 19:12:13 GMT
- Title: Reconstructing Materials Tetrahedron: Challenges in Materials Information Extraction
- Authors: Kausik Hira, Mohd Zaki, Dhruvil Sheth, Mausam, N M Anoop Krishnan,
- Abstract summary: We discuss, quantify, and document challenges in automated information extraction from materials science literature.
This information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style.
We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.
- Score: 23.489721319567025
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The discovery of new materials has a documented history of propelling human progress for centuries and more. The behaviour of a material is a function of its composition, structure, and properties, which further depend on its processing and testing conditions. Recent developments in deep learning and natural language processing have enabled information extraction at scale from published literature such as peer-reviewed publications, books, and patents. However, this information is spread in multiple formats, such as tables, text, and images, and with little or no uniformity in reporting style giving rise to several machine learning challenges. Here, we discuss, quantify, and document these challenges in automated information extraction (IE) from materials science literature towards the creation of a large materials science knowledge base. Specifically, we focus on IE from text and tables and outline several challenges with examples. We hope the present work inspires researchers to address the challenges in a coherent fashion, providing a fillip to IE towards developing a materials knowledge base.
Related papers
- Polymetis:Large Language Modeling for Multiple Material Domains [11.396295878658924]
This paper proposes a large language model Polymetis model for a variety of materials fields.
The model uses a dataset of about 2 million material knowledge instructions, and in the process of building the dataset, we developed the Intelligent Extraction Large Model.
We inject this data into the GLM4-9B model for learning to enhance its inference capabilities in a variety of material domains.
arXiv Detail & Related papers (2024-11-13T16:10:14Z) - From Text to Insight: Large Language Models for Materials Science Data Extraction [4.08853418443192]
The vast majority of materials science knowledge exists in unstructured natural language.
Structured data is crucial for innovative and systematic materials design.
The advent of large language models (LLMs) represents a significant shift.
arXiv Detail & Related papers (2024-07-23T22:23:47Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Lessons in Reproducibility: Insights from NLP Studies in Materials
Science [4.205692673448206]
We aim to comprehend these studies from a perspective, acknowledging their significant influence on the field of materials informatics, rather than critiquing them.
Our study indicates that both papers offered thorough, tidy and well-documenteds, and clear guidance for model evaluation.
We highlight areas for improvement such as to provide access to training data where copyright restrictions permit, more transparency on model architecture and the training process, and specifications of software dependency versions.
arXiv Detail & Related papers (2023-07-28T18:36:42Z) - Artificial Intelligence in Concrete Materials: A Scientometric View [77.34726150561087]
This chapter aims to uncover the main research interests and knowledge structure of the existing literature on AI for concrete materials.
To begin with, a total of 389 journal articles published from 1990 to 2020 were retrieved from the Web of Science.
Scientometric tools such as keyword co-occurrence analysis and documentation co-citation analysis were adopted to quantify features and characteristics of the research field.
arXiv Detail & Related papers (2022-09-17T18:24:56Z) - Embedding Knowledge for Document Summarization: A Survey [66.76415502727802]
Previous works proved that knowledge-embedded document summarizers excel at generating superior digests.
We propose novel to recapitulate knowledge and knowledge embeddings under the document summarization view.
arXiv Detail & Related papers (2022-04-24T04:36:07Z) - TegTok: Augmenting Text Generation via Task-specific and Open-world
Knowledge [83.55215993730326]
We propose augmenting TExt Generation via Task-specific and Open-world Knowledge (TegTok) in a unified framework.
Our model selects knowledge entries from two types of knowledge sources through dense retrieval and then injects them into the input encoding and output decoding stages respectively.
arXiv Detail & Related papers (2022-03-16T10:37:59Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z) - The SOFC-Exp Corpus and Neural Approaches to Information Extraction in
the Materials Science Domain [11.085048329202335]
We develop an annotation scheme for marking information on experiments related to solid oxide fuel cells in scientific publications.
A corpus and an inter-annotator agreement study demonstrate the complexity of the suggested named entity recognition.
We present strong neural-network based models for a variety of tasks that can be addressed on the basis of our new data set.
arXiv Detail & Related papers (2020-06-04T17:49:34Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.