Contrastive Learning of English Language and Crystal Graphs for Multimodal Representation of Materials Knowledge
- URL: http://arxiv.org/abs/2502.16451v1
- Date: Sun, 23 Feb 2025 05:39:46 GMT
- Title: Contrastive Learning of English Language and Crystal Graphs for Multimodal Representation of Materials Knowledge
- Authors: Yang Jeong Park, Mayank Kumaran, Chia-Wei Hsu, Elsa Olivetti, Ju Li,
- Abstract summary: We introduce a contrastive language-crystals model (CLaC) pre-trained on a newly synthesized dataset of 126k crystal structure-text pairs.<n>CLaC achieves state-of-the-art zero-shot generalization performance in understanding crystal structures.
- Score: 0.15978270011184253
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artificial intelligence (AI) is increasingly used for the inverse design of materials, such as crystals and molecules. Existing AI research on molecules has integrated chemical structures of molecules with textual knowledge to adapt to complex instructions. However, this approach has been unattainable for crystals due to data scarcity from the biased distribution of investigated crystals and the lack of semantic supervision in peer-reviewed literature. In this work, we introduce a contrastive language-crystals model (CLaC) pre-trained on a newly synthesized dataset of 126k crystal structure-text pairs. To demonstrate the advantage of using synthetic data to overcome data scarcity, we constructed a comparable dataset extracted from academic papers. We evaluate CLaC's generalization ability through various zero-shot cross-modal tasks and downstream applications. In experiments, CLaC achieves state-of-the-art zero-shot generalization performance in understanding crystal structures, surpassing latest large language models.
Related papers
- Causal Discovery from Data Assisted by Large Language Models [50.193740129296245]
It is essential to integrate experimental data with prior domain knowledge for knowledge driven discovery.
Here we demonstrate this approach by combining high-resolution scanning transmission electron microscopy (STEM) data with insights derived from large language models (LLMs)
By fine-tuning ChatGPT on domain-specific literature, we construct adjacency matrices for Directed Acyclic Graphs (DAGs) that map the causal relationships between structural, chemical, and polarization degrees of freedom in Sm-doped BiFeO3 (SmBFO)
arXiv Detail & Related papers (2025-03-18T02:14:49Z) - Contrastive Language-Structure Pre-training Driven by Materials Science Literature [10.170537065646323]
Contrastive Language--Structure Pre-training (CLaSP) is a learning paradigm for constructing crossmodal embedding spaces between crystal structures and texts.<n>CLaSP aims to achieve material embeddings that capture property- and functionality-related similarities between crystal structures.<n>We demonstrate the effectiveness of CLaSP through text-based crystal structure screening and embedding space visualization.
arXiv Detail & Related papers (2025-01-22T14:47:59Z) - Generative Hierarchical Materials Search [91.93125016916463]
We propose Generative Hierarchical Materials Search (GenMS) for controllable generation of crystal structures.
GenMS consists of (1) a language model that takes high-level natural language as input and generates intermediate textual information about a crystal.
GenMS additionally uses a graph neural network to predict properties (e.g., formation energy) from the generated crystal structures.
arXiv Detail & Related papers (2024-09-10T17:51:28Z) - Generative Inverse Design of Crystal Structures via Diffusion Models with Transformers [1.2289361708127877]
New inorganic materials with promising properties pose a critical challenge, both scientifically and for industrial applications.
Discovery of new inorganic materials with promising properties poses a critical challenge, both scientifically and for industrial applications.
In this study, we explore a new type of diffusion model for the generative inverse design of crystal structures, with a backbone based on a Transformer architecture.
arXiv Detail & Related papers (2024-06-13T16:03:15Z) - Text2Data: Low-Resource Data Generation with Textual Control [100.5970757736845]
Text2Data is a novel approach that utilizes unlabeled data to understand the underlying data distribution.<n>It undergoes finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting.
arXiv Detail & Related papers (2024-02-08T03:41:39Z) - Compositional Representation of Polymorphic Crystalline Materials [56.80318252233511]
We introduce PCRL, a novel approach that employs probabilistic modeling of composition to capture the diverse polymorphs from available structural information.<n>Extensive evaluations on sixteen datasets demonstrate the effectiveness of PCRL in learning compositional representation.
arXiv Detail & Related papers (2023-11-17T20:34:28Z) - Scalable Diffusion for Materials Generation [99.71001883652211]
We develop a unified crystal representation that can represent any crystal structure (UniMat)
UniMat can generate high fidelity crystal structures from larger and more complex chemical systems.
We propose additional metrics for evaluating generative models of materials.
arXiv Detail & Related papers (2023-10-18T15:49:39Z) - CrysMMNet: Multimodal Representation for Crystal Property Prediction [22.576167897068956]
We propose CrysMMNet, a simple multi-modal framework, which fuses both structural and textual representation together to generate a joint multimodal representation of crystalline materials.
We conduct extensive experiments on two benchmark datasets across ten different properties to show that CrysMMNet outperforms existing state-of-the-art baseline methods with a good margin.
arXiv Detail & Related papers (2023-06-09T11:16:01Z) - A data-driven interpretation of the stability of molecular crystals [0.0]
Predicting the stability of crystal structures formed from molecular building blocks is a non-trivial scientific problem.
We introduce a structural descriptor tailored to the prediction of the binding energy for a curated dataset of organic crystals.
We then interpret this library using a low-dimensional representation of the structure-energy landscape.
arXiv Detail & Related papers (2022-09-21T23:32:53Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.