Accelerating Materials Discovery: Learning a Universal Representation of Chemical Processes for Cross-Domain Property Prediction
- URL: http://arxiv.org/abs/2512.05979v1
- Date: Wed, 26 Nov 2025 12:19:14 GMT
- Title: Accelerating Materials Discovery: Learning a Universal Representation of Chemical Processes for Cross-Domain Property Prediction
- Authors: Mikhail Tsitsvero, Atsuyuki Nakao, Hisaki Ikebata,
- Abstract summary: We introduce a universal directed-tree process-graph representation that unifies unstructured text, molecular structures, and numeric measurements into a single machine-readable format.<n>Trained on approximately 700,000 process graphs from nearly 9,000 diverse documents, our model learns semantically rich embeddings that generalize across domains.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Experimental validation of chemical processes is slow and costly, limiting exploration in materials discovery. Machine learning can prioritize promising candidates, but existing data in patents and literature is heterogeneous and difficult to use. We introduce a universal directed-tree process-graph representation that unifies unstructured text, molecular structures, and numeric measurements into a single machine-readable format. To learn from this structured data, we developed a multi-modal graph neural network with a property-conditioned attention mechanism. Trained on approximately 700,000 process graphs from nearly 9,000 diverse documents, our model learns semantically rich embeddings that generalize across domains. When fine-tuned on compact, domain-specific datasets, the pretrained model achieves strong performance, demonstrating that universal process representations learned at scale transfer effectively to specialized prediction tasks with minimal additional data.
Related papers
- Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration [23.251258563998253]
We present CorDex, a framework that robustly learns dexterous functional grasps of novel objects from synthetic data.<n>At the core of our approach is a correspondence-based data engine that generates diverse, high-quality training data in simulation.<n>Building on the generated data, we introduce a multimodal prediction network that integrates visual and geometric information.
arXiv Detail & Related papers (2026-01-08T18:59:30Z) - Molecular Machine Learning Using Euler Characteristic Transforms [12.108680020079925]
Shape of a molecule determines its physicochemical and biological properties.<n>We propose using the Euler Characteristic Transform (ECT) as a geometrical-topological descriptor.<n>ECT enables the extraction of multiscale structural features, offering a novel way to represent and encode molecular shape in the feature space.
arXiv Detail & Related papers (2025-07-04T10:57:40Z) - A Large Encoder-Decoder Family of Foundation Models For Chemical Language [1.1073864511426255]
This paper introduces a large encoder-decoder chemical foundation models pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem.
Our experiments across multiple benchmark datasets validate the capacity of the proposed model in providing state-of-the-art results for different tasks.
arXiv Detail & Related papers (2024-07-24T20:30:39Z) - Structure to Property: Chemical Element Embeddings and a Deep Learning Approach for Accurate Prediction of Chemical Properties [0.0]
We introduce the elEmBERT model for chemical classification tasks.
It is based on deep learning techniques, such as a multilayer encoder architecture.
We demonstrate the opportunities offered by our approach on sets of organic, inorganic and crystalline compounds.
arXiv Detail & Related papers (2023-09-17T19:41:32Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - CHALLENGER: Training with Attribution Maps [63.736435657236505]
We show that utilizing attribution maps for training neural networks can improve regularization of models and thus increase performance.
In particular, we show that our generic domain-independent approach yields state-of-the-art results in vision, natural language processing and on time series tasks.
arXiv Detail & Related papers (2022-05-30T13:34:46Z) - Digital Fingerprinting of Microstructures [44.139970905896504]
Finding efficient means of fingerprinting microstructural information is a critical step towards harnessing data-centric machine learning approaches.
Here, we consider microstructure classification and utilise the resulting features over a range of related machine learning tasks.
In particular, methods that leverage transfer learning with convolutional neural networks (CNNs), pretrained on the ImageNet dataset, are generally shown to outperform other methods.
arXiv Detail & Related papers (2022-03-25T15:40:44Z) - Improving VAE based molecular representations for compound property
prediction [0.0]
We propose a simple method to improve chemical property prediction performance of machine learning models.
We show the relation between the performance of property prediction models and the distance between property prediction dataset and the larger unlabeled dataset.
arXiv Detail & Related papers (2022-01-13T12:57:11Z) - Self-supervised Graph-level Representation Learning with Local and
Global Structure [71.45196938842608]
We propose a unified framework called Local-instance and Global-semantic Learning (GraphLoG) for self-supervised whole-graph representation learning.
Besides preserving the local similarities, GraphLoG introduces the hierarchical prototypes to capture the global semantic clusters.
An efficient online expectation-maximization (EM) algorithm is further developed for learning the model.
arXiv Detail & Related papers (2021-06-08T05:25:38Z) - Learning the Implicit Semantic Representation on Graph-Structured Data [57.670106959061634]
Existing representation learning methods in graph convolutional networks are mainly designed by describing the neighborhood of each node as a perceptual whole.
We propose a Semantic Graph Convolutional Networks (SGCN) that explores the implicit semantics by learning latent semantic-paths in graphs.
arXiv Detail & Related papers (2021-01-16T16:18:43Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Machine learning with persistent homology and chemical word embeddings
improves prediction accuracy and interpretability in metal-organic frameworks [0.07874708385247352]
We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material's structure and chemistry.
It automatically encapsulates geometric and chemical information directly from the material system.
Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features.
arXiv Detail & Related papers (2020-10-01T16:31:46Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.