Difficulty in chirality recognition for Transformer architectures
learning chemical structures from string
- URL: http://arxiv.org/abs/2303.11593v4
- Date: Sun, 14 Jan 2024 00:18:44 GMT
- Title: Difficulty in chirality recognition for Transformer architectures
learning chemical structures from string
- Authors: Yasuhiro Yoshikai, Tadahaya Mizuno, Shumpei Nemoto, Hiroyuki Kusuhara
- Abstract summary: We investigate the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer.
We show that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen rapid development of descriptor generation based on
representation learning of extremely diverse molecules, especially those that
apply natural language processing (NLP) models to SMILES, a literal
representation of molecular structure. However, little research has been done
on how these models understand chemical structure. To address this black box,
we investigated the relationship between the learning progress of SMILES and
chemical structure using a representative NLP model, the Transformer. We show
that while the Transformer learns partial structures of molecules quickly, it
requires extended training to understand overall structures. Consistently, the
accuracy of molecular property predictions using descriptors generated from
models at different learning steps was similar from the beginning to the end of
training. Furthermore, we found that the Transformer requires particularly long
training to learn chirality and sometimes stagnates with low performance due to
misunderstanding of enantiomers. These findings are expected to deepen the
understanding of NLP models in chemistry.
Related papers
- How well can off-the-shelf LLMs elucidate molecular structures from mass spectra using chain-of-thought reasoning? [51.286853421822705]
Large language models (LLMs) have shown promise for reasoning-intensive scientific tasks, but their capability for chemical interpretation is still unclear.<n>We introduce a Chain-of-Thought (CoT) prompting framework and benchmark that evaluate how LLMs reason about mass spectral data to predict molecular structures.<n>Our evaluation across metrics of SMILES validity, formula consistency, and structural similarity reveals that while LLMs can produce syntactically valid and partially plausible structures, they fail to achieve chemical accuracy or link reasoning to correct molecular predictions.
arXiv Detail & Related papers (2026-01-09T20:08:42Z) - Circuits, Features, and Heuristics in Molecular Transformers [0.056179939237156]
We present a mechanistic analysis of autoregressive transformers trained on drug-like small molecules.<n>We identify computational patterns consistent with low-level syntactic parsing and more abstract chemical validity constraints.
arXiv Detail & Related papers (2025-12-10T15:35:22Z) - Mamba-driven multi-perspective structural understanding for molecular ground-state conformation prediction [69.32436472760712]
We propose an approach of Mamba-driven multi-perspective structural understanding (MPSU-Mamba) to localize molecular ground-state conformation.<n>For complex and diverse molecules, three different kinds of dedicated scanning strategies are explored to construct a comprehensive perception of corresponding molecular structures.<n> Experimental results on QM9 and Molecule3D datasets indicate that MPSU-Mamba significantly outperforms existing methods.
arXiv Detail & Related papers (2025-11-10T11:18:32Z) - Foundation Models for Discovery and Exploration in Chemical Space [57.97784111110166]
MIST is a family of molecular foundation models trained on large unlabeled datasets.<n>We demonstrate the ability of these models to solve real-world problems across chemical space.
arXiv Detail & Related papers (2025-10-20T17:56:01Z) - $\ ext{M}^{2}$LLM: Multi-view Molecular Representation Learning with Large Language Models [59.125833618091846]
We propose a multi-view framework that integrates three perspectives: the molecular structure view, the molecular task view, and the molecular rules view.<n>Experiments demonstrate that $textM2$LLM achieves state-of-the-art performance on multiple benchmarks across classification and regression tasks.
arXiv Detail & Related papers (2025-08-12T05:46:47Z) - DiffMS: Diffusion Generation of Molecules Conditioned on Mass Spectra [60.39311767532607]
DiffMS is a formula-restricted encoder-decoder generative network.
We develop a robust decoder that bridges latent embeddings and molecular structures.
Experiments show DiffMS outperforms existing models on $textitde novo$ molecule generation.
arXiv Detail & Related papers (2025-02-13T18:29:48Z) - GraphXForm: Graph transformer for computer-aided molecular design with application to extraction [73.1842164721868]
We present GraphXForm, a decoder-only graph transformer architecture, which is pretrained on existing compounds and then fine-tuned.
We evaluate it on two solvent design tasks for liquid-liquid extraction, showing that it outperforms four state-of-the-art molecular design techniques.
arXiv Detail & Related papers (2024-11-03T19:45:15Z) - Pre-trained Molecular Language Models with Random Functional Group Masking [54.900360309677794]
We propose a SMILES-based underlineem Molecular underlineem Language underlineem Model, which randomly masking SMILES subsequences corresponding to specific molecular atoms.
This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities.
arXiv Detail & Related papers (2024-11-03T01:56:15Z) - MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction [14.353313239109337]
MolTRES is a novel chemical language representation learning framework.
It incorporates generator-discriminator training, allowing the model to learn from more challenging examples.
Our model outperforms existing state-of-the-art models on popular molecular property prediction tasks.
arXiv Detail & Related papers (2024-07-09T01:14:28Z) - Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model [49.64512917330373]
We introduce a multi-constraint molecular generation large language model, TSMMG, akin to a student.
To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers'
We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements.
arXiv Detail & Related papers (2024-03-20T02:15:55Z) - Empirical Evidence for the Fragment level Understanding on Drug
Molecular Structure of LLMs [16.508471997999496]
We investigate whether and how language models understand the chemical spatial structure from 1D sequences.
The results indicate that language models can understand chemical structures from the perspective of molecular fragments.
arXiv Detail & Related papers (2024-01-15T12:53:58Z) - Implicit Geometry and Interaction Embeddings Improve Few-Shot Molecular
Property Prediction [53.06671763877109]
We develop molecular embeddings that encode complex molecular characteristics to improve the performance of few-shot molecular property prediction.
Our approach leverages large amounts of synthetic data, namely the results of molecular docking calculations.
On multiple molecular property prediction benchmarks, training from the embedding space substantially improves Multi-Task, MAML, and Prototypical Network few-shot learning performance.
arXiv Detail & Related papers (2023-02-04T01:32:40Z) - MolCPT: Molecule Continuous Prompt Tuning to Generalize Molecular
Representation Learning [77.31492888819935]
We propose a novel paradigm of "pre-train, prompt, fine-tune" for molecular representation learning, named molecule continuous prompt tuning (MolCPT)
MolCPT defines a motif prompting function that uses the pre-trained model to project the standalone input into an expressive prompt.
Experiments on several benchmark datasets show that MolCPT efficiently generalizes pre-trained GNNs for molecular property prediction.
arXiv Detail & Related papers (2022-12-20T19:32:30Z) - Infusing Linguistic Knowledge of SMILES into Chemical Language Models [0.3655021726150368]
We grammatically parsed SMILES to obtain connectivity between substructures and their type, which is called the grammatical knowledge of SMILES.
Our representation model outperformed previous compound representations for the prediction of molecular properties.
arXiv Detail & Related papers (2022-04-20T01:25:18Z) - GeoT: A Geometry-aware Transformer for Reliable Molecular Property
Prediction and Chemically Interpretable Representation Learning [16.484048833163282]
We introduce a novel Transformer-based framework for molecular representation learning, named the Geometry-aware Transformer (GeoT)
GeoT learns molecular graph structures through attention-based mechanisms specifically designed to offer reliable interpretability, as well as molecular property prediction.
Our comprehensive experiments, including an empirical simulation, reveal that GeoT effectively learns the chemical insights into molecular structures, bridging the gap between artificial intelligence and molecular sciences.
arXiv Detail & Related papers (2021-06-29T15:47:18Z) - Do Large Scale Molecular Language Representations Capture Important
Structural Information? [31.76876206167457]
We present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer.
Experiments show that the learned molecular representation performs competitively, when compared to graph-based and fingerprint-based supervised learning baselines.
arXiv Detail & Related papers (2021-06-17T14:33:55Z) - Learning Latent Space Energy-Based Prior Model for Molecule Generation [59.875533935578375]
We learn latent space energy-based prior model with SMILES representation for molecule modeling.
Our method is able to generate molecules with validity and uniqueness competitive with state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T09:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.