The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide
Electrocatalysis
- URL: http://arxiv.org/abs/2206.08917v1
- Date: Fri, 17 Jun 2022 17:54:10 GMT
- Title: The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide
Electrocatalysis
- Authors: Richard Tran, Janice Lan, Muhammed Shuaibi, Siddharth Goyal, Brandon
M. Wood, Abhishek Das, Javier Heras-Domingo, Adeesh Kolluru, Ammar Rizvi,
Nima Shoghi, Anuroop Sriram, Zachary Ulissi, C. Lawrence Zitnick
- Abstract summary: General machine learning potential that spans the chemical space of oxide materials is still out of reach.
Open Catalyst 2022(OC22) dataset consists of 62,521 Density Functional Theory (DFT) relaxations across a range of oxide materials.
We study whether combining datasets leads to better results, even if they contain different materials or adsorbates.
- Score: 9.9765107020148
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Computational catalysis and machine learning communities have made
considerable progress in developing machine learning models for catalyst
discovery and design. Yet, a general machine learning potential that spans the
chemical space of catalysis is still out of reach. A significant hurdle is
obtaining access to training data across a wide range of materials. One
important class of materials where data is lacking are oxides, which inhibits
models from studying the Oxygen Evolution Reaction and oxide electrocatalysis
more generally. To address this we developed the Open Catalyst 2022(OC22)
dataset, consisting of 62,521 Density Functional Theory (DFT) relaxations
(~9,884,504 single point calculations) across a range of oxide materials,
coverages, and adsorbates (*H, *O, *N, *C, *OOH, *OH, *OH2, *O2, *CO). We
define generalized tasks to predict the total system energy that are applicable
across catalysis, develop baseline performance of several graph neural networks
(SchNet, DimeNet++, ForceNet, SpinConv, PaiNN, GemNet-dT, GemNet-OC), and
provide pre-defined dataset splits to establish clear benchmarks for future
efforts. For all tasks, we study whether combining datasets leads to better
results, even if they contain different materials or adsorbates. Specifically,
we jointly train models on Open Catalyst 2020 (OC20) Dataset and OC22, or
fine-tune pretrained OC20 models on OC22. In the most general task, GemNet-OC
sees a ~32% improvement in energy predictions through fine-tuning and a ~9%
improvement in force predictions via joint training. Surprisingly, joint
training on both the OC20 and much smaller OC22 datasets also improves total
energy predictions on OC20 by ~19%. The dataset and baseline models are open
sourced, and a public leaderboard will follow to encourage continued community
developments on the total energy tasks and data.
Related papers
- Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models [3.865029260331255]
We present a Meta FAIR release of the Open Materials 2024 (OMat24) large-scale open dataset and an accompanying set of pre-trained models.
OMat24 contains over 110 million density functional theory (DFT) calculations focused on structural and compositional diversity.
Our EquiformerV2 models achieve state-of-the-art performance on the Matbench Discovery leaderboard.
arXiv Detail & Related papers (2024-10-16T17:48:34Z) - Dumpling GNN: Hybrid GNN Enables Better ADC Payload Activity Prediction Based on Chemical Structure [53.76752789814785]
DumplingGNN is a hybrid Graph Neural Network architecture specifically designed for predicting ADC payload activity based on chemical structure.
We evaluate it on a comprehensive ADC payload dataset focusing on DNA Topoisomerase I inhibitors.
It demonstrates exceptional accuracy (91.48%), sensitivity (95.08%), and specificity (97.54%) on our specialized ADC payload dataset.
arXiv Detail & Related papers (2024-09-23T17:11:04Z) - On the importance of catalyst-adsorbate 3D interactions for relaxed
energy predictions [98.70797778496366]
We investigate whether it is possible to predict a system's relaxed energy in the OC20 dataset while ignoring the relative position of the adsorbate.
We find that while removing binding site information impairs accuracy as expected, modified models are able to predict relaxed energies with remarkably decent MAE.
arXiv Detail & Related papers (2023-10-10T14:57:04Z) - Activity Cliff Prediction: Dataset and Benchmark [20.41770222873952]
We first introduce ACNet, a large-scale dataset for AC prediction.
ACNet curates over 400K Matched Molecular Pairs (MMPs) against 190 targets.
We propose a baseline framework to benchmark the predictive performance of molecular representations encoded by deep neural networks for AC prediction.
arXiv Detail & Related papers (2023-02-15T09:19:07Z) - PhAST: Physics-Aware, Scalable, and Task-specific GNNs for Accelerated
Catalyst Design [102.9593507372373]
Catalyst materials play a crucial role in the electrochemical reactions involved in industrial processes.
Machine learning holds the potential to efficiently model materials properties from large amounts of data.
We propose task-specific innovations applicable to most architectures, enhancing both computational efficiency and accuracy.
arXiv Detail & Related papers (2022-11-22T05:24:30Z) - How Do Graph Networks Generalize to Large and Diverse Molecular Systems? [10.690849483282564]
We identify four aspects of complexity in which many datasets are lacking.
We propose the GemNet-OC model, which outperforms the previous state-of-the-art on OC20 by 16%.
Our findings challenge the common belief that graph neural networks work equally well independent of dataset size and diversity.
arXiv Detail & Related papers (2022-04-06T12:52:34Z) - An Empirical Study of Graphormer on Large-Scale Molecular Modeling
Datasets [87.00711479972503]
"Graphormer-V2" could attain better results on large-scale molecular modeling datasets than the vanilla one.
With a global receptive field and an adaptive aggregation strategy, Graphormer is more powerful than classic message-passing-based GNNs.
arXiv Detail & Related papers (2022-02-28T16:32:42Z) - Accelerating COVID-19 research with graph mining and transformer-based
learning [2.493740042317776]
We present an automated general purpose hypothesis generation systems AGATHA-C and AGATHA-GP for COVID-19 research.
Both systems achieve high-quality predictions across domains (in some domains up to 0.97% ROC AUC) in fast computational time.
We show that the systems are able to discover on-going research findings such as the relationship between COVID-19 and oxytocin hormone.
arXiv Detail & Related papers (2021-02-10T15:11:36Z) - Unassisted Noise Reduction of Chemical Reaction Data Sets [59.127921057012564]
We propose a machine learning-based, unassisted approach to remove chemically wrong entries from data sets.
Our results show an improved prediction quality for models trained on the cleaned and balanced data sets.
arXiv Detail & Related papers (2021-02-02T09:34:34Z) - The Open Catalyst 2020 (OC20) Dataset and Community Challenges [36.556154866045894]
Catalyst discovery and optimization is key to solving many societal and energy challenges.
It remains an open challenge to build models that can generalize across both elemental compositions of surfaces and adsorbates.
arXiv Detail & Related papers (2020-10-20T03:29:18Z) - Assessing Graph-based Deep Learning Models for Predicting Flash Point [52.931492216239995]
Graph-based deep learning (GBDL) models were implemented in predicting flash point for the first time.
Average R2 and Mean Absolute Error (MAE) scores of MPNN are, respectively, 2.3% lower and 2.0 K higher than previous comparable studies.
arXiv Detail & Related papers (2020-02-26T06:10:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.