Omics-scale polymer computational database transferable to real-world artificial intelligence applications
- URL: http://arxiv.org/abs/2511.11626v1
- Date: Fri, 07 Nov 2025 09:03:07 GMT
- Title: Omics-scale polymer computational database transferable to real-world artificial intelligence applications
- Authors: Ryo Yoshida, Yoshihiro Hayashi, Hidemine Furuya, Ryohei Hosoya, Kazuyoshi Kaneko, Hiroki Sugisawa, Yu Kaneko, Aiko Takahashi, Yoh Noguchi, Shun Nanjo, Keiko Shinoda, Tomu Hamakawa, Mitsuru Ohno, Takuya Kitamura, Misaki Yonekawa, Stephen Wu, Masato Ohnishi, Chang Liu, Teruki Tsurimoto, Arifin, Araki Wakiuchi, Kohei Noda, Junko Morikawa, Teruaki Hayakawa, Junichiro Shiomi, Masanobu Naito, Kazuya Shiratori, Tomoki Nagai, Norio Tomotsu, Hiroto Inoue, Ryuichi Sakashita, Masashi Ishii, Isao Kuwajima, Kenji Furuichi, Norihiko Hiroi, Yuki Takemoto, Takahiro Ohkuma, Keita Yamamoto, Naoya Kowatari, Masato Suzuki, Naoya Matsumoto, Seiryu Umetani, Hisaki Ikebata, Yasuyuki Shudo, Mayu Nagao, Shinya Kamada, Kazunori Kamio, Taichi Shomura, Kensaku Nakamura, Yudai Iwamizu, Atsutoshi Abe, Koki Yoshitomi, Yuki Horie, Katsuhiko Koike, Koichi Iwakabe, Shinya Gima, Kota Usui, Gikyo Usuki, Takuro Tsutsumi, Keitaro Matsuoka, Kazuki Sada, Masahiro Kitabata, Takuma Kikutsuji, Akitaka Kamauchi, Yusuke Iijima, Tsubasa Suzuki, Takenori Goda, Yuki Takabayashi, Kazuko Imai, Yuji Mochizuki, Hideo Doi, Koji Okuwaki, Hiroya Nitta, Taku Ozawa, Hitoshi Kamijima, Toshiaki Shintani, Takuma Mitamura, Massimiliano Zamengo, Yuitsu Sugami, Seiji Akiyama, Yoshinari Murakami, Atsushi Betto, Naoya Matsuo, Satoru Kagao, Tetsuya Kobayashi, Norie Matsubara, Shosei Kubo, Yuki Ishiyama, Yuri Ichioka, Mamoru Usami, Satoru Yoshizaki, Seigo Mizutani, Yosuke Hanawa, Shogo Kunieda, Mitsuru Yambe, Takeru Nakamura, Hiromori Murashima, Kenji Takahashi, Naoki Wada, Masahiro Kawano, Yosuke Harada, Takehiro Fujita, Erina Fujita, Ryoji Himeno, Hiori Kino, Kenji Fukumizu,
- Abstract summary: PolyOmics is an omics-scale computational database generated through fully automated molecular dynamics simulation pipelines.<n>Machine learning models pretrained on PolyOmics can be efficiently fine-tuned for a wide range of real-world downstream tasks.
- Score: 8.718893022299653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing large-scale foundational datasets is a critical milestone in advancing artificial intelligence (AI)-driven scientific innovation. However, unlike AI-mature fields such as natural language processing, materials science, particularly polymer research, has significantly lagged in developing extensive open datasets. This lag is primarily due to the high costs of polymer synthesis and property measurements, along with the vastness and complexity of the chemical space. This study presents PolyOmics, an omics-scale computational database generated through fully automated molecular dynamics simulation pipelines that provide diverse physical properties for over $10^5$ polymeric materials. The PolyOmics database is collaboratively developed by approximately 260 researchers from 48 institutions to bridge the gap between academia and industry. Machine learning models pretrained on PolyOmics can be efficiently fine-tuned for a wide range of real-world downstream tasks, even when only limited experimental data are available. Notably, the generalisation capability of these simulation-to-real transfer models improve significantly as the size of the PolyOmics database increases, exhibiting power-law scaling. The emergence of scaling laws supports the "more is better" principle, highlighting the significance of ultralarge-scale computational materials data for improving real-world prediction performance. This unprecedented omics-scale database reveals vast unexplored regions of polymer materials, providing a foundation for AI-driven polymer science.
Related papers
- Open Polymer Challenge: Post-Competition Report [34.36687017237976]
The Open Polymer Challenge (OPC) releases the first community-developed benchmark for polymer informatics.<n>The challenge centers on multi-task polymer property prediction, a core step in virtual screening pipelines for materials discovery.<n>We release the test dataset at https://www.kaggle.com/datasets/alexliu99/neurips-open-polymer-prediction-2025-test-data.
arXiv Detail & Related papers (2025-12-09T18:38:15Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers [251.23085679210206]
Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research.<n>This survey reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate.<n>We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge.
arXiv Detail & Related papers (2025-08-28T18:30:52Z) - POINT$^{2}$: A Polymer Informatics Training and Testing Database [15.45788515943579]
POINT$2$ (POlymer INformatics Training and Testing) is a benchmark database and protocol designed to address critical challenges in polymer informatics.<n>We develop an ensemble of ML models, including Quantile Random Forests, Multilayer Perceptrons with dropout, Graph Neural Networks, and pretrained large language models.<n>These models are coupled with diverse polymer representations such as Morgan, MACCS, RDKit, Topological, Atom Pair fingerprints, and graph-based descriptors.
arXiv Detail & Related papers (2025-03-30T15:46:01Z) - Multimodal machine learning with large language embedding model for polymer property prediction [2.525624865489335]
We propose a simple yet effective multimodal architecture, PolyLLMem, for polymer properties prediction tasks.<n>PolyLLMem integrates text embeddings generated by Llama 3 with molecular structure embeddings derived from Uni-Mol.<n>Its performance is comparable to, and in some cases exceeds, that of graph-based models, as well as transformer-based models.
arXiv Detail & Related papers (2025-03-29T03:48:11Z) - DARWIN 1.5: Large Language Models as Materials Science Adapted Learners [46.7259033847682]
We propose DARWIN 1.5, the largest open-source large language model tailored for materials science.<n> DARWIN eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery.<n>Our approach integrates 6M material domain papers and 21 experimental datasets from 49,256 materials across modalities while enabling cross-task knowledge transfer.
arXiv Detail & Related papers (2024-12-16T16:51:27Z) - Transferring a molecular foundation model for polymer property
predictions [3.067983186439152]
Self-supervised pretraining of transformer models requires large-scale datasets.
We show that using transformers pretrained on small molecules and fine-tuned on polymer properties achieve comparable accuracy to those trained on augmented polymer datasets.
arXiv Detail & Related papers (2023-10-25T19:55:00Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - TransPolymer: a Transformer-based language model for polymer property
predictions [9.04563945965023]
TransPolymer is a Transformer-based language model for polymer property prediction.
Our proposed polymer tokenizer with chemical awareness enables learning representations from polymer sequences.
arXiv Detail & Related papers (2022-09-03T01:29:59Z) - Accurate Machine Learned Quantum-Mechanical Force Fields for
Biomolecular Simulations [51.68332623405432]
Molecular dynamics (MD) simulations allow atomistic insights into chemical and biological processes.
Recently, machine learned force fields (MLFFs) emerged as an alternative means to execute MD simulations.
This work proposes a general approach to constructing accurate MLFFs for large-scale molecular simulations.
arXiv Detail & Related papers (2022-05-17T13:08:28Z) - BIGDML: Towards Exact Machine Learning Force Fields for Materials [55.944221055171276]
Machine-learning force fields (MLFF) should be accurate, computationally and data efficient, and applicable to molecules, materials, and interfaces thereof.
Here, we introduce the Bravais-Inspired Gradient-Domain Machine Learning approach and demonstrate its ability to construct reliable force fields using a training set with just 10-200 atoms.
arXiv Detail & Related papers (2021-06-08T10:14:57Z) - Polymer Informatics: Current Status and Critical Next Steps [1.3238373064156097]
Surrogate models are trained on available polymer data for instant property prediction.
Data-driven strategies to tackle unique challenges resulting from the extraordinary chemical and physical diversity of polymers at small and large scales are being explored.
Methods to solve inverse problems, wherein polymer recommendations are made using advanced AI algorithms that meet application targets, are being investigated.
arXiv Detail & Related papers (2020-11-01T14:17:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.