MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials
Modeling
- URL: http://arxiv.org/abs/2309.05934v1
- Date: Tue, 12 Sep 2023 03:08:37 GMT
- Title: MatSciML: A Broad, Multi-Task Benchmark for Solid-State Materials
Modeling
- Authors: Kin Long Kelvin Lee, Carmelo Gonzales, Marcel Nassar, Matthew
Spellings, Mikhail Galkin, Santiago Miret
- Abstract summary: MatSci ML is a benchmark for modeling MATerials SCIence using Machine Learning (MatSci ML) methods.
MatSci ML provides a diverse set of materials systems and properties data for model training and evaluation.
In the multi-dataset learning setting, MatSci ML enables researchers to combine observations from multiple datasets to perform joint prediction of common properties.
- Score: 7.142619575624596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose MatSci ML, a novel benchmark for modeling MATerials SCIence using
Machine Learning (MatSci ML) methods focused on solid-state materials with
periodic crystal structures. Applying machine learning methods to solid-state
materials is a nascent field with substantial fragmentation largely driven by
the great variety of datasets used to develop machine learning models. This
fragmentation makes comparing the performance and generalizability of different
methods difficult, thereby hindering overall research progress in the field.
Building on top of open-source datasets, including large-scale datasets like
the OpenCatalyst, OQMD, NOMAD, the Carolina Materials Database, and Materials
Project, the MatSci ML benchmark provides a diverse set of materials systems
and properties data for model training and evaluation, including simulated
energies, atomic forces, material bandgaps, as well as classification data for
crystal symmetries via space groups. The diversity of properties in MatSci ML
makes the implementation and evaluation of multi-task learning algorithms for
solid-state materials possible, while the diversity of datasets facilitates the
development of new, more generalized algorithms and methods across multiple
datasets. In the multi-dataset learning setting, MatSci ML enables researchers
to combine observations from multiple datasets to perform joint prediction of
common properties, such as energy and forces. Using MatSci ML, we evaluate the
performance of different graph neural networks and equivariant point cloud
networks on several benchmark tasks spanning single task, multitask, and
multi-data learning scenarios. Our open-source code is available at
https://github.com/IntelLabs/matsciml.
Related papers
- Multi-Task Multi-Fidelity Learning of Properties for Energetic Materials [34.8008617873679]
We find that multi-task neural networks can learn from multi-modal data and outperform single-task models trained for specific properties.
As expected, the improvement is more significant for data-scarce properties.
This approach is widely applicable to fields outside energetic materials.
arXiv Detail & Related papers (2024-08-21T12:54:26Z) - Advancing Multimodal Large Language Models in Chart Question Answering with Visualization-Referenced Instruction Tuning [1.6570772838074355]
multimodal large language models (MLLMs) exhibit great potential for chart question answering (CQA)
Recent efforts primarily focus on scaling up training datasets through data collection and synthesis.
We propose a visualization-referenced instruction tuning approach to guide the training dataset enhancement and model development.
arXiv Detail & Related papers (2024-07-29T17:04:34Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Multimodal Learning for Materials [7.167520424757711]
We introduce Multimodal Learning for Materials (MultiMat), which enables self-supervised multi-modality training of foundation models for materials.
We demonstrate our framework's potential using data from the Materials Project database on multiple axes.
arXiv Detail & Related papers (2023-11-30T18:35:29Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - PyHard: a novel tool for generating hardness embeddings to support
data-centric analysis [0.38233569758620045]
PyHard produces a hardness embedding of a dataset relating the predictive performance of multiple ML models.
The user can interact with this embedding in multiple ways to obtain useful insights about data and algorithmic performance.
We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models.
arXiv Detail & Related papers (2021-09-29T14:08:26Z) - Memory-Based Optimization Methods for Model-Agnostic Meta-Learning and
Personalized Federated Learning [56.17603785248675]
Model-agnostic meta-learning (MAML) has become a popular research area.
Existing MAML algorithms rely on the episode' idea by sampling a few tasks and data points to update the meta-model at each iteration.
This paper proposes memory-based algorithms for MAML that converge with vanishing error.
arXiv Detail & Related papers (2021-06-09T08:47:58Z) - Intelligent multiscale simulation based on process-guided composite
database [0.0]
We present an integrated data-driven modeling framework based on process modeling, material homogenization, and machine learning.
We are interested in the injection-molded short fiber reinforced composites, which have been identified as key material systems in automotive, aerospace, and electronics industries.
arXiv Detail & Related papers (2020-03-20T20:39:19Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.