1.5 million materials narratives generated by chatbots
- URL: http://arxiv.org/abs/2308.13687v1
- Date: Fri, 25 Aug 2023 22:00:53 GMT
- Title: 1.5 million materials narratives generated by chatbots
- Authors: Yang Jeong Park, Sung Eun Jerng, Jin-Sung Park, Choah Kwon, Chia-Wei
Hsu, Zhichu Ren, Sungroh Yoon, and Ju Li
- Abstract summary: We have generated a dataset of 1,494,017 natural language-material paragraphs based on combined OQMD, Materials Project, JARVIS, COD and AFLOW2 databases.
The generated text narratives were then polled and scored by both human experts and ChatGPT-4, based on three rubrics: technical accuracy, language and structure, and relevance and depth of content.
- Score: 25.125848842769464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of artificial intelligence (AI) has enabled a comprehensive
exploration of materials for various applications. However, AI models often
prioritize frequently encountered materials in the scientific literature,
limiting the selection of suitable candidates based on inherent physical and
chemical properties. To address this imbalance, we have generated a dataset of
1,494,017 natural language-material paragraphs based on combined OQMD,
Materials Project, JARVIS, COD and AFLOW2 databases, which are dominated by ab
initio calculations and tend to be much more evenly distributed on the periodic
table. The generated text narratives were then polled and scored by both human
experts and ChatGPT-4, based on three rubrics: technical accuracy, language and
structure, and relevance and depth of content, showing similar scores but with
human-scored depth of content being the most lagging. The merger of
multi-modality data sources and large language model (LLM) holds immense
potential for AI frameworks to help the exploration and discovery of
solid-state materials for specific applications.
Related papers
- From Tokens to Materials: Leveraging Language Models for Scientific Discovery [12.211984932142537]
This study investigates the application of language model embeddings to enhance material property prediction in materials science.
We demonstrate that domain-specific models, particularly MatBERT, significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties.
arXiv Detail & Related papers (2024-10-21T16:31:23Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z) - Agent-based Learning of Materials Datasets from Scientific Literature [0.0]
We develop a chemist AI agent, powered by large language models (LLMs), to create structured datasets from natural language text.
Our chemist AI agent, Eunomia, can plan and execute actions by leveraging the existing knowledge from decades of scientific research articles.
arXiv Detail & Related papers (2023-12-18T20:29:58Z) - Materials Expert-Artificial Intelligence for Materials Discovery [39.67752644916519]
We introduce "Materials Expert-Artificial Intelligence" (ME-AI) to encapsulate and articulate this human intuition.
The ME-AI learned descriptors independently reproduce expert intuition and expand upon it.
Our success points to the "machine bottling human insight" approach as promising for machine learning-aided material discovery.
arXiv Detail & Related papers (2023-12-05T14:29:18Z) - MatChat: A Large Language Model and Application Service Platform for
Materials Science [18.55541324347915]
We harness the power of the LLaMA2-7B model and enhance it through a learning process that incorporates 13,878 pieces of structured material knowledge data.
This specialized AI model, named MatChat, focuses on predicting inorganic material synthesis pathways.
MatChat is now accessible online and open for use, with both the model and its application framework available as open source.
arXiv Detail & Related papers (2023-10-11T05:11:46Z) - Leveraging Language Representation for Material Recommendation, Ranking,
and Exploration [0.0]
We introduce a material discovery framework that uses natural language embeddings derived from language models as representations of compositional and structural features.
By applying the framework to thermoelectrics, we demonstrate diversified recommendations of prototype structures and identify under-studied high-performance material spaces.
arXiv Detail & Related papers (2023-05-01T21:58:29Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - Artificial Intelligence in Concrete Materials: A Scientometric View [77.34726150561087]
This chapter aims to uncover the main research interests and knowledge structure of the existing literature on AI for concrete materials.
To begin with, a total of 389 journal articles published from 1990 to 2020 were retrieved from the Web of Science.
Scientometric tools such as keyword co-occurrence analysis and documentation co-citation analysis were adopted to quantify features and characteristics of the research field.
arXiv Detail & Related papers (2022-09-17T18:24:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.