Related papers: Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates

URL: http://arxiv.org/abs/2412.01564v1
Date: Mon, 02 Dec 2024 14:50:44 GMT
Title: Tokenizing 3D Molecule Structure with Quantized Spherical Coordinates
Authors: Kaiyuan Gao, Yusong Wang, Haoxiang Guan, Zun Wang, Qizhi Pei, John E. Hopcroft, Kun He, Lijun Wu,
Abstract summary: Mol-StrucTok is a novel method for tokenizing 3D molecular structures.<n>We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system.<n>We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors.
Score: 28.452581855002855
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The application of language models (LMs) to molecular structure generation using line notations such as SMILES and SELFIES has been well-established in the field of cheminformatics. However, extending these models to generate 3D molecular structures presents significant challenges. Two primary obstacles emerge: (1) the difficulty in designing a 3D line notation that ensures SE(3)-invariant atomic coordinates, and (2) the non-trivial task of tokenizing continuous coordinates for use in LMs, which inherently require discrete inputs. To address these challenges, we propose Mol-StrucTok, a novel method for tokenizing 3D molecular structures. Our approach comprises two key innovations: (1) We design a line notation for 3D molecules by extracting local atomic coordinates in a spherical coordinate system. This notation builds upon existing 2D line notations and remains agnostic to their specific forms, ensuring compatibility with various molecular representation schemes. (2) We employ a Vector Quantized Variational Autoencoder (VQ-VAE) to tokenize these coordinates, treating them as generation descriptors. To further enhance the representation, we incorporate neighborhood bond lengths and bond angles as understanding descriptors. Leveraging this tokenization framework, we train a GPT-2 style model for 3D molecular generation tasks. Results demonstrate strong performance with significantly faster generation speeds and competitive chemical stability compared to previous methods. Further, by integrating our learned discrete representations into Graphormer model for property prediction on QM9 dataset, Mol-StrucTok reveals consistent improvements across various molecular properties, underscoring the versatility and robustness of our approach.

Related papers

Aligned Manifold Property and Topology Point Clouds for Learning Molecular Properties [55.2480439325792]
This work introduces AMPTCR, a molecular surface representation that combines local quantum-derived scalar fields and custom topological descriptors within an aligned point cloud format.<n>For molecular weight, results confirm that AMPTCR encodes physically meaningful data, with a validation R2 of 0.87.<n>In the bacterial inhibition task, AMPTCR enables both classification and direct regression of E. coli inhibition values.
arXiv Detail & Related papers (2025-07-22T04:35:50Z)
Sampling 3D Molecular Conformers with Diffusion Transformers [13.536503487456622]
Diffusion Transformers (DiTs) have demonstrated strong performance in generative modeling.<n>Applying DiTs to molecules introduces novel challenges, such as integrating discrete molecular graph information with continuous 3D geometry.<n>We propose DiTMC, a framework that adapts DiTs to address these challenges through a modular architecture.
arXiv Detail & Related papers (2025-06-18T11:47:59Z)
Towards Unified Latent Space for 3D Molecular Latent Diffusion Modeling [80.59215359958934]
3D molecule generation is crucial for drug discovery and material science. Existing approaches typically maintain separate latent spaces for invariant and equivariant modalities. We propose a multi-modal VAE that compresses 3D molecules into latent sequences from a unified latent space.
arXiv Detail & Related papers (2025-03-19T08:56:13Z)
DPLM-2: A Multimodal Diffusion Protein Language Model [75.98083311705182]
We introduce DPLM-2, a multimodal protein foundation model that extends discrete diffusion protein language model (DPLM) to accommodate both sequences and structures. DPLM-2 learns the joint distribution of sequence and structure, as well as their marginals and conditionals. Empirical evaluation shows that DPLM-2 can simultaneously generate highly compatible amino acid sequences and their corresponding 3D structures.
arXiv Detail & Related papers (2024-10-17T17:20:24Z)
MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning [17.93173928602627]
We propose a simple transformer-based baseline for multimodal molecular representation learning. We integrate three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules. Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets.
arXiv Detail & Related papers (2024-10-10T14:36:58Z)
Geometry Informed Tokenization of Molecules for Language Model Generation [85.80491667588923]
We consider molecule generation in 3D space using language models (LMs) Although tokenization of molecular graphs exists, that for 3D geometries is largely unexplored. We propose the Geo2Seq, which converts molecular geometries into $SE(3)$-invariant 1D discrete sequences.
arXiv Detail & Related papers (2024-08-19T16:09:59Z)
3D-MolT5: Leveraging Discrete Structural Information for Molecule-Text Modeling [41.07090635630771]
We propose textbf3D-MolT5, a unified framework designed to model molecule in both sequence and 3D structure spaces. Key innovation of our approach lies in mapping fine-grained 3D substructure representations into a specialized 3D token vocabulary. Our approach significantly improves cross-modal interaction and alignment, addressing key challenges in previous work.
arXiv Detail & Related papers (2024-06-09T14:20:55Z)
Geometry-aware Line Graph Transformer Pre-training for Molecular Property Prediction [4.598522704308923]
Geometry-aware line graph transformer (Galformer) pre-training is a novel self-supervised learning framework. Galformer consistently outperforms all baselines on both classification and regression tasks.
arXiv Detail & Related papers (2023-09-01T14:20:48Z)
CoarsenConf: Equivariant Coarsening with Aggregated Attention for Molecular Conformer Generation [3.31521245002301]
We introduce CoarsenConf, which integrates molecular graphs based on torsional angles into an SE(3)-equivariant hierarchical variational autoencoder. Through equivariant coarse-graining, we aggregate the fine-grained atomic coordinates of subgraphs connected via rotatable bonds, creating a variable-length coarse-grained latent representation. Our model uses a novel aggregated attention mechanism to restore fine-grained coordinates from the coarse-grained latent representation, enabling efficient generation of accurate conformers.
arXiv Detail & Related papers (2023-06-26T17:02:54Z)
Generation of 3D Molecules in Pockets via Language Model [0.0]
Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design. We introduce Lingo3DMol, a pocket-based 3D molecule generation method that combines language models and geometric deep learning technology.
arXiv Detail & Related papers (2023-05-17T11:31:06Z)
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go [109.88509362837475]
We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes. NeuroMorph produces smooth and point-to-point correspondences between them. It works well for a large variety of input shapes, including non-isometric pairs from different object categories.
arXiv Detail & Related papers (2021-06-17T12:25:44Z)
GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles [60.12186997181117]
Prediction of a molecule's 3D conformer ensemble from the molecular graph holds a key role in areas of cheminformatics and drug discovery. Existing generative models have several drawbacks including lack of modeling important molecular geometry elements. We propose GeoMol, an end-to-end, non-autoregressive and SE(3)-invariant machine learning approach to generate 3D conformers.
arXiv Detail & Related papers (2021-06-08T14:17:59Z)
An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming [71.82571553927619]
We propose an end-to-end solution for molecular conformation prediction called ConfVAE. Specifically, the molecular graph is first encoded in a latent space, and then the 3D structures are generated by solving a principled bilevel optimization program.
arXiv Detail & Related papers (2021-05-15T15:22:29Z)
Dense Non-Rigid Structure from Motion: A Manifold Viewpoint [162.88686222340962]
Non-Rigid Structure-from-Motion (NRSfM) problem aims to recover 3D geometry of a deforming object from its 2D feature correspondences across multiple frames. We show that our approach significantly improves accuracy, scalability, and robustness against noise.
arXiv Detail & Related papers (2020-06-15T09:15:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.