Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates
- URL: http://arxiv.org/abs/2412.01564v1
- Date: Mon, 02 Dec 2024 14:50:44 GMT
- Title: Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates
- Authors: Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John E. Hopcroft, Kun He, Lijun Wu,
- Abstract summary: Mol-StrucTok is a novel method for tokenizing 3D molecular structures.
We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system.
We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors.
- Score: 28.452581855002855
- License:
- Abstract: The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.
Related papers
- DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures.
DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals.
Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z) - MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning [17.93173928602627]
We propose a simple transformer-based baseline for multimodal molecular representation learning.
We integrate three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules.
Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets.
arXiv Detail & Related papers (2024-10-10T14:36:58Z) - Geometry Informed Tokenization of Molecules for Language Model Generation [85.80491667588923]
We consider molecule generation in 3D space using language models (LMs)
Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored.
We propose the Geo2Seq, which converts molecular geometries into $SE(3)$-invariant 1D discrete sequences.
arXiv Detail & Related papers (2024-08-19T16:09:59Z) - Structure-Aware E(3)-Invariant Molecular Conformer Aggregation Networks [43.80038907470173]
A molecule's 2D representation consists of its atoms, their attributes, and the molecule's covalent bonds.
A 3D representation of a molecule is called a conformer and consists of its atom types and Cartesian coordinates.
arXiv Detail & Related papers (2024-02-03T00:58:41Z) - Geometry-aware Line Graph Transformer Pre-training for Molecular
Property Prediction [4.598522704308923]
Geometry-aware line graph transformer (Galformer) pre-training is a novel self-supervised learning framework.
Galformer consistently outperforms all baselines on both classification and regression tasks.
arXiv Detail & Related papers (2023-09-01T14:20:48Z) - CoarsenConf: Equivariant Coarsening with Aggregated Attention for
Molecular Conformer Generation [3.31521245002301]
We introduce CoarsenConf, which integrates molecular graphs based on torsional angles into an SE(3)-equivariant hierarchical variational autoencoder.
Through equivariant coarse-graining, we aggregate the fine-grained atomic coordinates of subgraphs connected via rotatable bonds, creating a variable-length coarse-grained latent representation.
Our model uses a novel aggregated attention mechanism to restore fine-grained coordinates from the coarse-grained latent representation, enabling efficient generation of accurate conformers.
arXiv Detail & Related papers (2023-06-26T17:02:54Z) - NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One
Go [109.88509362837475]
We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes.
NeuroMorph produces smooth and point-to-point correspondences between them.
It works well for a large variety of input shapes, including non-isometric pairs from different object categories.
arXiv Detail & Related papers (2021-06-17T12:25:44Z) - GeoMol: Torsional Geometric Generation of Molecular 3D Conformer
Ensembles [60.12186997181117]
Prediction of a molecule's 3D conformer ensemble from the molecular graph holds a key role in areas of cheminformatics and drug discovery.
Existing generative models have several drawbacks including lack of modeling important molecular geometry elements.
We propose GeoMol, an end-to-end, non-autoregressive and SE(3)-invariant machine learning approach to generate 3D conformers.
arXiv Detail & Related papers (2021-06-08T14:17:59Z) - An End-to-End Framework for Molecular Conformation Generation via
Bilevel Programming [71.82571553927619]
We propose an end-to-end solution for molecular conformation prediction called ConfVAE.
Specifically, the molecular graph is first encoded in a latent space, and then the 3D structures are generated by solving a principled bilevel optimization program.
arXiv Detail & Related papers (2021-05-15T15:22:29Z) - Dense Non-Rigid Structure from Motion: A Manifold Viewpoint [162.88686222340962]
Non-Rigid Structure-from-Motion (NRSfM) problem aims to recover 3D geometry of a deforming object from its 2D feature correspondences across multiple frames.
We show that our approach significantly improves accuracy, scalability, and robustness against noise.
arXiv Detail & Related papers (2020-06-15T09:15:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.