MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials
Modeling
- URL: http://arxiv.org/abs/2309.05934v1
- Date: Tue, 12 Sep 2023 03:08:37 GMT
- Title: MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials
Modeling
- Authors: Kin Long Kelvin Lee, Carmelo Gonzales, Marcel Nassar, Matthew
Spellings, Mikhail Galkin, Santiago Miret
- Abstract summary: MatSci ML is a benchmark for modeling MATerials SCIence using Machine Learning (MatSci ML) methods.
MatSci ML provides a diverse set of materials systems and properties data for model training and evaluation.
In the multi-dataset learning setting, MatSci ML enables researchers to combine observations from multiple datasets to perform joint prediction of common properties.
- Score: 7.142619575624596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose MatSci ML, a novel benchmark for modeling MATerials SCIence using
Machine Learning (MatSci ML) methods focused on solid-state materials with
periodic crystal structures. Applying machine learning methods to solid-state
materials is a nascent field with substantial fragmentation largely driven by
the great variety of datasets used to develop machine learning models. This
fragmentation makes comparing the performance and generalizability of different
methods difficult, thereby hindering overall research progress in the field.
Building on top of open-source datasets, including large-scale datasets like
the OpenCatalyst, OQMD, NOMAD, the Carolina Materials Database, and Materials
Project, the MatSci ML benchmark provides a diverse set of materials systems
and properties data for model training and evaluation, including simulated
energies, atomic forces, material bandgaps, as well as classification data for
crystal symmetries via space groups. The diversity of properties in MatSci ML
makes the implementation and evaluation of multi-task learning algorithms for
solid-state materials possible, while the diversity of datasets facilitates the
development of new, more generalized algorithms and methods across multiple
datasets. In the multi-dataset learning setting, MatSci ML enables researchers
to combine observations from multiple datasets to perform joint prediction of
common properties, such as energy and forces. Using MatSci ML, we evaluate the
performance of different graph neural networks and equivariant point cloud
networks on several benchmark tasks spanning single task, multitask, and
multi-data learning scenarios. Our open-source code is available at
https://github.com/IntelLabs/matsciml.
Related papers
- Physics Informed Machine Learning (PIML) methods for estimating the remaining useful lifetime (RUL) of aircraft engines [0.0]
This paper is aimed at using the newly developing field of physics informed machine learning (PIML) to develop models for predicting the remaining useful lifetime (RUL) aircraft engines.
We consider the well-known benchmark NASA Commercial Modular Aero-Propulsion System Simulation System (C-MAPSS) data as the main data for this paper.
C-MAPSS is a well-studied dataset with much existing work in the literature that address RUL prediction with classical and deep learning methods.
arXiv Detail & Related papers (2024-06-21T19:55:34Z) - Multimodal Learning for Materials [7.167520424757711]
We introduce Multimodal Learning for Materials (MultiMat), which enables self-supervised multi-modality training of foundation models for materials.
We demonstrate our framework's potential using data from the Materials Project database on multiple axes.
arXiv Detail & Related papers (2023-11-30T18:35:29Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT [9.33544942080883]
This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science.
We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release.
We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
arXiv Detail & Related papers (2023-04-05T04:01:52Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - PyHard: a novel tool for generating hardness embeddings to support
data-centric analysis [0.38233569758620045]
PyHard produces a hardness embedding of a dataset relating the predictive performance of multiple ML models.
The user can interact with this embedding in multiple ways to obtain useful insights about data and algorithmic performance.
We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models.
arXiv Detail & Related papers (2021-09-29T14:08:26Z) - Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and
Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area.
Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration.
This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z) - Intelligent multiscale simulation based on process-guided composite
database [0.0]
We present an integrated data-driven modeling framework based on process modeling, material homogenization, and machine learning.
We are interested in the injection-molded short fiber reinforced composites, which have been identified as key material systems in automotive, aerospace, and electronics industries.
arXiv Detail & Related papers (2020-03-20T20:39:19Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.