Chaining thoughts and LLMs to learn DNA structural biophysics
- URL: http://arxiv.org/abs/2403.01332v1
- Date: Sat, 2 Mar 2024 22:38:01 GMT
- Title: Chaining thoughts and LLMs to learn DNA structural biophysics
- Authors: Tyler D. Ross, Ashwin Gopinath
- Abstract summary: We show that a general purpose large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the structural biophysics of DNA.
We find that both fine-tuning models to return chain-of-thought responses and chaining together models fine-tuned for subtasks have an enhanced ability to analyze and design DNA sequences and their structures.
- Score: 6.164223149261533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The future development of an AI scientist, a tool that is capable of
integrating a variety of experimental data and generating testable hypotheses,
holds immense potential. So far, bespoke machine learning models have been
created to specialize in singular scientific tasks, but otherwise lack the
flexibility of a general purpose model. Here, we show that a general purpose
large language model, chatGPT 3.5-turbo, can be fine-tuned to learn the
structural biophysics of DNA. We find that both fine-tuning models to return
chain-of-thought responses and chaining together models fine-tuned for subtasks
have an enhanced ability to analyze and design DNA sequences and their
structures.
Related papers
- HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.
This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.
HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z) - NatureLM: Deciphering the Language of Nature for Scientific Discovery [105.57567762153462]
Foundation models have revolutionized natural language processing and artificial intelligence.
We introduce Nature Language Model (briefly, NatureLM), a sequence-based science foundation model for scientific discovery.
arXiv Detail & Related papers (2025-02-11T13:08:03Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.
This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.
We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Cognitive Evolutionary Learning to Select Feature Interactions for Recommender Systems [59.117526206317116]
We show that CELL can adaptively evolve into different models for different tasks and data.
Experiments on four real-world datasets demonstrate that CELL significantly outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-29T02:35:23Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Empowering Biomedical Discovery with AI Agents [15.125735219811268]
We envision "AI scientists" as systems capable of skeptical learning and reasoning.
Biomedical AI agents combine human creativity and expertise with AI's ability to analyze large datasets.
AI agents can impact areas ranging from virtual cell simulation, programmable control of phenotypes, and the design of cellular circuits to developing new therapies.
arXiv Detail & Related papers (2024-04-03T16:08:01Z) - Molecular modeling with machine-learned universal potential functions [15.138489177130511]
We show that neural networks can be used to train an universal approximator for energy potential functions.
We have been able to train smooth, differentiable, predictive potential functions on large scale crystal structures.
arXiv Detail & Related papers (2021-03-06T17:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.