Language Models for Controllable DNA Sequence Design
- URL: http://arxiv.org/abs/2507.19523v1
- Date: Sat, 19 Jul 2025 06:23:17 GMT
- Title: Language Models for Controllable DNA Sequence Design
- Authors: Xingyu Su, Xiner Li, Yuchao Lin, Ziqian Xie, Degui Zhi, Shuiwang Ji,
- Abstract summary: We introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation.<n>ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures.<n>Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences.
- Score: 41.74647005781059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider controllable DNA sequence design, where sequences are generated by conditioning on specific biological properties. While language models (LMs) such as GPT and BERT have achieved remarkable success in natural language generation, their application to DNA sequence generation remains largely underexplored. In this work, we introduce ATGC-Gen, an Automated Transformer Generator for Controllable Generation, which leverages cross-modal encoding to integrate diverse biological signals. ATGC-Gen is instantiated with both decoder-only and encoder-only transformer architectures, allowing flexible training and generation under either autoregressive or masked recovery objectives. We evaluate ATGC-Gen on representative tasks including promoter and enhancer sequence design, and further introduce a new dataset based on ChIP-Seq experiments for modeling protein binding specificity. Our experiments demonstrate that ATGC-Gen can generate fluent, diverse, and biologically relevant sequences aligned with the desired properties. Compared to prior methods, our model achieves notable improvements in controllability and functional relevance, highlighting the potential of language models in advancing programmable genomic design. The source code is released at (https://github.com/divelab/AIRS/blob/main/OpenBio/ATGC_Gen).
Related papers
- Regulatory DNA sequence Design with Reinforcement Learning [56.20290878358356]
We propose a generative approach that leverages reinforcement learning to fine-tune a pre-trained autoregressive model.<n>We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types.
arXiv Detail & Related papers (2025-03-11T02:33:33Z) - Can Large Language Models Predict Antimicrobial Resistance Gene? [0.0]
This study evaluates how effectively generative Large Language Models handle DNA sequences with various labels.<n>Experiments were conducted on antimicrobial resistance genes, and the results show that generative Large Language Models can offer comparable or potentially better predictions.
arXiv Detail & Related papers (2025-03-06T13:10:57Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA [2.543784712990392]
Large genomic DNA language models (DNALMs) aim to learn generalizable representations of diverse DNA elements.<n>Our benchmarks target biologically meaningful downstream tasks such as functional sequence feature discovery, predicting cell-type specific regulatory activity, and counterfactual prediction of the impacts of genetic variants.
arXiv Detail & Related papers (2024-12-06T21:23:35Z) - Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [59.60208063956459]
Large Language Models (LLMs) require high quality instruction data for effective alignment.<n>We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions.
arXiv Detail & Related papers (2024-07-29T20:42:59Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision [26.107996342704915]
This paper presents the Ensemble Nucleotide Byte-level-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture.
We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks.
In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
arXiv Detail & Related papers (2023-11-04T06:00:56Z) - Epigenomic language models powered by Cerebras [0.0]
Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs.
We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task.
Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
arXiv Detail & Related papers (2021-12-14T17:23:42Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.