Chaining thoughts and LLMs to learn DNA structural biophysics
- URL: http://arxiv.org/abs/2403.01332v1
- Date: Sat, 2 Mar 2024 22:38:01 GMT
- Title: Chaining thoughts and LLMs to learn DNA structural biophysics
- Authors: Tyler D. Ross, Ashwin Gopinath
- Abstract summary: We show that a general purpose large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the structural biophysics of DNA.
We find that both fine-tuning models to return chain-of-thought responses and chaining together models fine-tuned for subtasks have an enhanced ability to analyze and design DNA sequences and their structures.
- Score: 6.164223149261533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The future development of an AI scientist, a tool that is capable of
integrating a variety of experimental data and generating testable hypotheses,
holds immense potential. So far, bespoke machine learning models have been
created to specialize in singular scientific tasks, but otherwise lack the
flexibility of a general purpose model. Here, we show that a general purpose
large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the
structural biophysics of DNA. We find that both fine-tuning models to return
chain-of-thought responses and chaining together models fine-tuned for subtasks
have an enhanced ability to analyze and design DNA sequences and their
structures.
Related papers
- UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion [61.690978792873196]
Existing approaches rely on either autoregressive sequence models or diffusion models.
We propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models.
We validate the effectiveness of UniGenX on material and small molecule generation tasks.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - Nature Language Model: Deciphering the Language of Nature for Scientific Discovery [105.55751854768297]
Foundation models have revolutionized natural language processing and artificial intelligence.
We introduce Nature Language Model (NatureLM), a sequence-based science foundation model for scientific discovery.
arXiv Detail & Related papers (2025-02-11T13:08:03Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.
This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.
We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Long Term Memory: The Foundation of AI Self-Evolution [48.52678410533424]
Large language models (LLMs) like GPTs, trained on vast datasets, have demonstrated impressive capabilities in language understanding, reasoning, and planning.
Most studies focus on enhancing these models by training on ever-larger datasets to build more powerful foundation models.
Unlike large-scale training, enabling models to evolve during inference is equally crucial, a process we refer to as AI self-evolution.
arXiv Detail & Related papers (2024-10-21T06:09:30Z) - Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions [4.36852565205713]
We present our work training the largest open-source multi-omic foundation model to date.
We show that these multi-omic models can learn joint representations between various single-omic distributions.
We also demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks.
arXiv Detail & Related papers (2024-08-29T03:56:40Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Cognitive Evolutionary Learning to Select Feature Interactions for Recommender Systems [59.117526206317116]
We show that CELL can adaptively evolve into different models for different tasks and data.
Experiments on four real-world datasets demonstrate that CELL significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-29T02:35:23Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Empowering Biomedical Discovery with AI Agents [15.125735219811268]
We envision "AI scientists" as systems capable of skeptical learning and reasoning.
Biomedical AI agents combine human creativity and expertise with AI's ability to analyze large datasets.
AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies.
arXiv Detail & Related papers (2024-04-03T16:08:01Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - Constructing Effective Machine Learning Models for the Sciences: A
Multidisciplinary Perspective [77.53142165205281]
We show how flexible non-linear solutions will not always improve upon manually adding transforms and interactions between variables to linear regression models.
We discuss how to recognize this before constructing a data-driven model and how such analysis can help us move to intrinsically interpretable regression models.
arXiv Detail & Related papers (2022-11-21T17:48:44Z) - Modeling Protein Using Large-scale Pretrain Language Model [12.568452480689578]
Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets.
Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences.
Our model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences.
arXiv Detail & Related papers (2021-08-17T04:13:11Z) - Molecular modeling with machine-learned universal potential functions [15.138489177130511]
We show that neural networks can be used to train an universal approximator for energy potential functions.
We have been able to train smooth, differentiable, predictive potential functions on large scale crystal structures.
arXiv Detail & Related papers (2021-03-06T17:36:39Z) - Physics-Integrated Variational Autoencoders for Robust and Interpretable
Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models.
We propose a VAE architecture in which a part of the latent space is grounded by physics.
We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z) - Hierarchical, rotation-equivariant neural networks to select structural
models of protein complexes [6.092214762701847]
We introduce a machine learning method that learns directly from the 3D positions of all atoms to identify accurate models of protein complexes.
Our network substantially improves the identification of accurate structural models among a large set of possible models.
arXiv Detail & Related papers (2020-06-05T20:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.