Universal Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery
- URL: http://arxiv.org/abs/2502.14912v1
- Date: Wed, 19 Feb 2025 07:26:03 GMT
- Title: Universal Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery
- Authors: Yunze Jia, Yuehui Xian, Yangyang Xu, Pengfei Dang, Xiangdong Ding, Jun Sun, Yumei Zhou, Dezhen Xue,
- Abstract summary: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery.<n>This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers.
- Score: 10.842037420887468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a framework for generating universal semantic embeddings of chemical elements to advance materials inference and discovery. This framework leverages ElementBERT, a domain-specific BERT-based natural language processing model trained on 1.29 million abstracts of alloy-related scientific papers, to capture latent knowledge and contextual relationships specific to alloys. These semantic embeddings serve as robust elemental descriptors, consistently outperforming traditional empirical descriptors with significant improvements across multiple downstream tasks. These include predicting mechanical and transformation properties, classifying phase structures, and optimizing materials properties via Bayesian optimization. Applications to titanium alloys, high-entropy alloys, and shape memory alloys demonstrate up to 23% gains in prediction accuracy. Our results show that ElementBERT surpasses general-purpose BERT variants by encoding specialized alloy knowledge. By bridging contextual insights from scientific literature with quantitative inference, our framework accelerates the discovery and optimization of advanced materials, with potential applications extending beyond alloys to other material classes.
Related papers
- Graft: Integrating the Domain Knowledge via Efficient Parameter Synergy for MLLMs [56.76586846269894]
Multimodal Large Language Models (MLLMs) have achieved success across various domains.<n>Despite its importance, the study of knowledge sharing among domain-specific MLLMs remains largely underexplored.<n>We propose a unified parameter integration framework that enables modular composition of expert capabilities.
arXiv Detail & Related papers (2025-06-30T15:07:41Z) - KEPLA: A Knowledge-Enhanced Deep Learning Framework for Accurate Protein-Ligand Binding Affinity Prediction [60.23701115249195]
KEPLA is a novel deep learning framework that integrates prior knowledge from Gene Ontology and ligand properties to enhance prediction performance.<n> Experiments on two benchmark datasets demonstrate that KEPLA consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-06-16T08:02:42Z) - Information fusion strategy integrating pre-trained language model and contrastive learning for materials knowledge mining [0.4128284355136163]
Machine learning has revolutionized materials design, yet predicting complex properties like alloy ductility remains challenging.<n>Here, we present an innovative information fusion architecture that integrates domain-specific texts from materials science literature with quantitative physical descriptors to overcome these limitations.<n>Our framework employs MatSciBERT for advanced textual comprehension and incorporates contrastive learning to automatically extract implicit knowledge regarding processing parameters and microstructural characteristics.
arXiv Detail & Related papers (2025-06-14T14:09:00Z) - Learning Item Representations Directly from Multimodal Features for Effective Recommendation [51.49251689107541]
multimodal recommender systems predominantly leverage Bayesian Personalized Ranking (BPR) optimization to learn item representations.<n>We propose a novel model (i.e., LIRDRec) that learns item representations directly from multimodal features to augment recommendation performance.
arXiv Detail & Related papers (2025-05-08T05:42:22Z) - Causal Discovery from Data Assisted by Large Language Models [50.193740129296245]
It is essential to integrate experimental data with prior domain knowledge for knowledge driven discovery.
Here we demonstrate this approach by combining high-resolution scanning transmission electron microscopy (STEM) data with insights derived from large language models (LLMs)
By fine-tuning ChatGPT on domain-specific literature, we construct adjacency matrices for Directed Acyclic Graphs (DAGs) that map the causal relationships between structural, chemical, and polarization degrees of freedom in Sm-doped BiFeO3 (SmBFO)
arXiv Detail & Related papers (2025-03-18T02:14:49Z) - Inverse Materials Design by Large Language Model-Assisted Generative Framework [35.04390544440238]
AlloyGAN is a framework that integrates Large Language Model (LLM)-assisted text mining with Conditional Generative Adversarial Networks (CGANs)
For metallic glasses, the framework predicts thermodynamic properties with discrepancies of less than 8% from experiments.
By bridging generative AI with domain knowledge, AlloyGAN offers a scalable approach to accelerate the discovery of materials with tailored properties.
arXiv Detail & Related papers (2025-02-25T11:52:59Z) - DARWIN 1.5: Large Language Models as Materials Science Adapted Learners [46.7259033847682]
We propose DARWIN 1.5, the largest open-source large language model tailored for materials science.<n> DARWIN eliminates the need for task-specific descriptors and enables a flexible, unified approach to material property prediction and discovery.<n>Our approach integrates 6M material domain papers and 21 experimental datasets from 49,256 materials across modalities while enabling cross-task knowledge transfer.
arXiv Detail & Related papers (2024-12-16T16:51:27Z) - Material Property Prediction with Element Attribute Knowledge Graphs and Multimodal Representation Learning [8.523289773617503]
We build an element property knowledge graph and utilize an embedding model to encode the element attributes within the knowledge graph.
A multimodal fusion framework, ESNet, integrates element property features with crystal structure features to generate joint multimodal representations.
This provides a more comprehensive perspective for predicting the performance of crystalline materials.
arXiv Detail & Related papers (2024-11-13T08:07:21Z) - From Tokens to Materials: Leveraging Language Models for Scientific Discovery [12.211984932142537]
This study investigates the application of language model embeddings to enhance material property prediction in materials science.
We demonstrate that domain-specific models, particularly MatBERT, significantly outperform general-purpose models in extracting implicit knowledge from compound names and material properties.
arXiv Detail & Related papers (2024-10-21T16:31:23Z) - FecTek: Enhancing Term Weight in Lexicon-Based Retrieval with Feature Context and Term-level Knowledge [54.61068946420894]
We introduce an innovative method by introducing FEature Context and TErm-level Knowledge modules.
To effectively enrich the feature context representations of term weight, the Feature Context Module (FCM) is introduced.
We also develop a term-level knowledge guidance module (TKGM) for effectively utilizing term-level knowledge to intelligently guide the modeling process of term weight.
arXiv Detail & Related papers (2024-04-18T12:58:36Z) - AlloyBERT: Alloy Property Prediction with Large Language Models [5.812284760539713]
This study introduces AlloyBERT, a transformer encoder-based model designed to predict alloy properties using textual inputs.
By combining a tokenizer trained on our textual data and a RoBERTa encoder pre-trained and fine-tuned for this specific task, we achieved a mean squared error (MSE) of 0.00015 on the Multi Principal Elemental Alloys (MPEA) data set and 0.00611 on the Refractory Alloy Yield Strength (RAYS) dataset.
Our results highlight the potential of language models in material science and establish a foundational framework for text-based prediction of alloy properties.
arXiv Detail & Related papers (2024-03-28T19:09:46Z) - FKA-Owl: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs [48.32113486904612]
We propose FKA-Owl, a framework that leverages forgery-specific knowledge to augment Large Vision-Language Models (LVLMs)
Experiments on the public benchmark demonstrate that FKA-Owl achieves superior cross-domain performance compared to previous methods.
arXiv Detail & Related papers (2024-03-04T12:35:09Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - Leveraging Language Representation for Material Recommendation, Ranking,
and Exploration [0.0]
We introduce a material discovery framework that uses natural language embeddings derived from language models as representations of compositional and structural features.
By applying the framework to thermoelectrics, we demonstrate diversified recommendations of prototype structures and identify under-studied high-performance material spaces.
arXiv Detail & Related papers (2023-05-01T21:58:29Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z) - Improving Compositional Generalization in Semantic Parsing [54.4720965813889]
Generalization of models to out-of-distribution (OOD) data has captured tremendous attention recently.
We investigate compositional generalization in semantic parsing, a natural test-bed for compositional generalization.
arXiv Detail & Related papers (2020-10-12T12:34:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.