Related papers: Chaining thoughts and LLMs to learn DNA structural biophysics

Related papers

UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion [61.690978792873196]
Existing approaches rely on either autoregressive sequence models or diffusion models. We propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. We validate the effectiveness of UniGenX on material and small molecule generation tasks.
arXiv Detail & Related papers (2025-03-09T16:43:07Z)
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.<n>This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.<n>HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z)
Nature Language Model: Deciphering the Language of Nature for Scientific Discovery [105.55751854768297]
Foundation models have revolutionized natural language processing and artificial intelligence. We introduce Nature Language Model (NatureLM), a sequence-based science foundation model for scientific discovery.
arXiv Detail & Related papers (2025-02-11T13:08:03Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset. This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks. We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z)
MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language [0.4631438140637248]
MAMMAL is a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities.<n> evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks.<n> explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets.
arXiv Detail & Related papers (2024-10-28T20:45:52Z)
Long Term Memory: The Foundation of AI Self-Evolution [48.52678410533424]
Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning. Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models. Unlike large-scale training, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution.
arXiv Detail & Related papers (2024-10-21T06:09:30Z)
Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions [4.36852565205713]
We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models can learn joint representations between various single-omic distributions. We also demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks.
arXiv Detail & Related papers (2024-08-29T03:56:40Z)
Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms. We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z)
Cognitive Evolutionary Learning to Select Feature Interactions for Recommender Systems [59.117526206317116]
We show that CELL can adaptively evolve into different models for different tasks and data. Experiments on four real-world datasets demonstrate that CELL significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-29T02:35:23Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Empowering Biomedical Discovery with AI Agents [15.125735219811268]
We envision "AI scientists" as systems capable of skeptical learning and reasoning. Biomedical AI agents combine human creativity and expertise with AI's ability to analyze large datasets. AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies.
arXiv Detail & Related papers (2024-04-03T16:08:01Z)
xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously. xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z)
Constructing Effective Machine Learning Models for the Sciences: A Multidisciplinary Perspective [77.53142165205281]
We show how flexible non-linear solutions will not always improve upon manually adding transforms and interactions between variables to linear regression models. We discuss how to recognize this before constructing a data-driven model and how such analysis can help us move to intrinsically interpretable regression models.
arXiv Detail & Related papers (2022-11-21T17:48:44Z)
Modeling Protein Using Large-scale Pretrain Language Model [12.568452480689578]
Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences. Our model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences.
arXiv Detail & Related papers (2021-08-17T04:13:11Z)
Molecular modeling with machine-learned universal potential functions [15.138489177130511]
We show that neural networks can be used to train an universal approximator for energy potential functions. We have been able to train smooth, differentiable, predictive potential functions on large scale crystal structures.
arXiv Detail & Related papers (2021-03-06T17:36:39Z)
Physics-Integrated Variational Autoencoders for Robust and Interpretable Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models. We propose a VAE architecture in which a part of the latent space is grounded by physics. We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)
Hierarchical, rotation-equivariant neural networks to select structural models of protein complexes [6.092214762701847]
We introduce a machine learning method that learns directly from the 3D positions of all atoms to identify accurate models of protein complexes. Our network substantially improves the identification of accurate structural models among a large set of possible models.
arXiv Detail & Related papers (2020-06-05T20:17:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.