Related papers: Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

Bio2Token: All-atom tokenization of any biomolecular structure with Mamba

URL: http://arxiv.org/abs/2410.19110v1
Date: Thu, 24 Oct 2024 19:23:09 GMT
Title: Bio2Token: All-atom tokenization of any biomolecular structure with Mamba
Authors: Andrew Liu, Axel Elaldi, Nathan Russell, Olivia Viessmann,
Abstract summary: We develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies below and around 1 Angstrom. We demonstrate that the Mamba state space model architecture employed is comparatively efficient, requiring a fraction of the training data, parameters and compute needed to reach competitive accuracies and can scale to systems with almost 100,000 atoms.
Score: 3.039173168183899
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Efficient encoding and representation of large 3D molecular structures with high fidelity is critical for biomolecular design applications. Despite this, many representation learning approaches restrict themselves to modeling smaller systems or use coarse-grained approximations of the systems, for example modeling proteins at the resolution of amino acid residues rather than at the level of individual atoms. To address this, we develop quantized auto-encoders that learn atom-level tokenizations of complete proteins, RNA and small molecule structures with reconstruction accuracies below and around 1 Angstrom. We demonstrate that the Mamba state space model architecture employed is comparatively efficient, requiring a fraction of the training data, parameters and compute needed to reach competitive accuracies and can scale to systems with almost 100,000 atoms. The learned structure tokens of bio2token may serve as the input for all-atom language models in the future.

Related papers

DualEquiNet: A Dual-Space Hierarchical Equivariant Network for Large Biomolecules [32.33126287600196]
We introduce DualEquiNet, a Dual-Space Hierarchical Equivariant Network that constructs complementary representations in both Euclidean and Spherical Harmonics spaces to capture local geometry and global symmetry-aware features.<n> DualEquiNet achieves state-of-the-art performance on multiple existing benchmarks for RNA property prediction and protein modeling, and outperforms prior methods on two newly introduced 3D structural benchmarks.
arXiv Detail & Related papers (2025-06-10T07:43:50Z)
PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation [4.402280157389038]
We propose PharMolixFM, a unified framework for constructing all-atom foundation models. Our framework includes three variants using state-of-the-art multi-modal generative models. PharMolixFM-Diff achieves competitive prediction accuracy in protein-small-molecule docking.
arXiv Detail & Related papers (2025-03-12T12:53:43Z)
GraphXForm: Graph transformer for computer-aided molecular design with application to extraction [73.1842164721868]
We present GraphXForm, a decoder-only graph transformer architecture, which is pretrained on existing compounds and then fine-tuned. We evaluate it on two solvent design tasks for liquid-liquid extraction, showing that it outperforms four state-of-the-art molecular design techniques.
arXiv Detail & Related papers (2024-11-03T19:45:15Z)
UniIF: Unified Molecule Inverse Folding [67.60267592514381]
We propose a unified model UniIF for inverse folding of all molecules. Our proposed method surpasses state-of-the-art methods on all tasks.
arXiv Detail & Related papers (2024-05-29T10:26:16Z)
CryoChains: Heterogeneous Reconstruction of Molecular Assembly of Semi-flexible Chains from Cryo-EM Images [3.0828074702828623]
We propose CryoChains that encodes large deformations of biomolecules via rigid body transformation of their chains. Our data experiments on the human GABAtextsubscriptB and heat shock protein show that CryoChains gives a biophysically-grounded quantification of the heterogeneous conformations of biomolecules.
arXiv Detail & Related papers (2023-06-12T17:57:12Z)
Towards Predicting Equilibrium Distributions for Molecular Systems with Deep Learning [60.02391969049972]
We introduce a novel deep learning framework, called Distributional Graphormer (DiG), in an attempt to predict the equilibrium distribution of molecular systems. DiG employs deep neural networks to transform a simple distribution towards the equilibrium distribution, conditioned on a descriptor of a molecular system.
arXiv Detail & Related papers (2023-06-08T17:12:08Z)
Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations [55.42602325017405]
We propose a novel method called GODE, which takes into account the two-level structure of individual molecules. By pre-training two graph neural networks (GNNs) on different graph structures, combined with contrastive learning, GODE fuses molecular structures with their corresponding knowledge graph substructures. When fine-tuned across 11 chemical property tasks, our model outperforms existing benchmarks, registering an average ROC-AUC uplift of 13.8% for classification tasks and an average RMSE/MAE enhancement of 35.1% for regression tasks.
arXiv Detail & Related papers (2023-06-02T15:49:45Z)
MUDiff: Unified Diffusion for Complete Molecule Generation [104.7021929437504]
We present a new model for generating a comprehensive representation of molecules, including atom features, 2D discrete molecule structures, and 3D continuous molecule coordinates. We propose a novel graph transformer architecture to denoise the diffusion process. Our model is a promising approach for designing stable and diverse molecules and can be applied to a wide range of tasks in molecular modeling.
arXiv Detail & Related papers (2023-04-28T04:25:57Z)
Heterogeneous reconstruction of deformable atomic models in Cryo-EM [30.864688165021054]
We describe a heterogeneous reconstruction method based on an atomistic representation whose deformation is reduced to a handful of collective motions. We show for each distribution that our approach is able to recapitulate the intermediate atomic models with atomic-level accuracy.
arXiv Detail & Related papers (2022-09-29T22:35:35Z)
Learning Geometrically Disentangled Representations of Protein Folding Simulations [72.03095377508856]
This work focuses on learning a generative neural network on a structural ensemble of a drug-target protein. Model tasks involve characterizing the distinct structural fluctuations of the protein bound to various drug molecules. Results show that our geometric learning-based method enjoys both accuracy and efficiency for generating complex structural variations.
arXiv Detail & Related papers (2022-05-20T19:38:00Z)
Accurate Machine Learned Quantum-Mechanical Force Fields for Biomolecular Simulations [51.68332623405432]
Molecular dynamics (MD) simulations allow atomistic insights into chemical and biological processes. Recently, machine learned force fields (MLFFs) emerged as an alternative means to execute MD simulations. This work proposes a general approach to constructing accurate MLFFs for large-scale molecular simulations.
arXiv Detail & Related papers (2022-05-17T13:08:28Z)
Transferring Chemical and Energetic Knowledge Between Molecular Systems with Machine Learning [5.27145343046974]
We propose a novel methodology for transferring knowledge obtained from simple molecular systems to a more complex one. We focus on the classification of high and low free-energy states. Our results show a remarkable AUC of 0.92 for transfer learning from tri-alanine to the deca-alanine system.
arXiv Detail & Related papers (2022-05-06T16:21:00Z)
Scalable Fragment-Based 3D Molecular Design with Reinforcement Learning [68.8204255655161]
We introduce a novel framework for scalable 3D design that uses a hierarchical agent to build molecules. In a variety of experiments, we show that our agent, guided only by energy considerations, can efficiently learn to produce molecules with over 100 atoms.
arXiv Detail & Related papers (2022-02-01T18:54:24Z)
A silicon qubit platform for in situ single molecule structure determination [0.7187911114620571]
Imaging individual conformational instances of generic, inhomogeneous, transient or intrinsically disordered protein systems at the single molecule level in situ is one of the notable challenges in structural biology. Here we tackle the problem by designing a single molecule imaging platform technology embracing the advantages silicon-based spin qubits. We demonstrate through detailed simulation, that this platform enables scalable atomic-level structure-determination of individual molecular systems in native environments.
arXiv Detail & Related papers (2021-12-07T10:42:09Z)
Message Passing Networks for Molecules with Tetrahedral Chirality [8.391459650489123]
We develop two custom aggregation functions for message passing neural networks to learn properties of molecules with tetrahedral chirality. Results show modest improvements over a baseline sum aggregator, highlighting opportunities for further architecture development.
arXiv Detail & Related papers (2020-11-24T03:03:09Z)
Self-Supervised Graph Transformer on Large-Scale Molecular Data [73.3448373618865]
We propose a novel framework, GROVER, for molecular representation learning. GROVER can learn rich structural and semantic information of molecules from enormous unlabelled molecular data. We pre-train GROVER with 100 million parameters on 10 million unlabelled molecules -- the biggest GNN and the largest training dataset in molecular representation learning.
arXiv Detail & Related papers (2020-06-18T08:37:04Z)
Hierarchical, rotation-equivariant neural networks to select structural models of protein complexes [6.092214762701847]
We introduce a machine learning method that learns directly from the 3D positions of all atoms to identify accurate models of protein complexes. Our network substantially improves the identification of accurate structural models among a large set of possible models.
arXiv Detail & Related papers (2020-06-05T20:17:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.