Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models
- URL: http://arxiv.org/abs/2509.12266v1
- Date: Sat, 13 Sep 2025 03:31:55 GMT
- Title: Genome-Factory: An Integrated Library for Tuning, Deploying, and Interpreting Genomic Models
- Authors: Weimin Wu, Xuefeng Song, Yibo Wen, Qinjie Lin, Zhihan Zhou, Jerry Yao-Chieh Hu, Zhong Wang, Han Liu,
- Abstract summary: Genome-Factory is an integrated Python library for tuning, deploying, and interpreting genomic models.<n>For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them.<n>For inference, Genome-Factory enables both embedding extraction and DNA sequence generation.<n>For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder.
- Score: 15.523936567029624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Genome-Factory, an integrated Python library for tuning, deploying, and interpreting genomic models. Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability. For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them. It also includes quality control, such as GC content normalization. For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning. It is compatible with a wide range of genomic models. For inference, Genome-Factory enables both embedding extraction and DNA sequence generation. For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks. For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder. This module disentangles embeddings into sparse, near-monosemantic latent units and links them to interpretable genomic features by regressing on external readouts. To improve accessibility, Genome-Factory features both a zero-code command-line interface and a user-friendly web interface. We validate the utility of Genome-Factory across three dimensions: (i) Compatibility with diverse models and fine-tuning methods; (ii) Benchmarking downstream performance using two open-source benchmarks; (iii) Biological interpretation of learned representations with DNABERT-2. These results highlight its end-to-end usability and practical value for real-world genomic analysis.
Related papers
- PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes [9.805758991551043]
PlantBiMoE is a lightweight and expressive plant genome language model.<n>It integrates a bidirectional Mamba and a Sparse Mixture-of-Experts framework.
arXiv Detail & Related papers (2025-12-08T02:51:46Z) - PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer [54.958921946378304]
We introduce PanFoMa, a lightweight hybrid neural network that combines the strengths of Transformers and state-space models.<n>PanFoMa consists of a front-end local-context encoder with shared self-attention layers to capture complex, order-independent gene interactions.<n>We also construct a large-scale pan-cancer single-cell benchmark, PanFoMaBench, containing over 3.5 million high-quality cells.
arXiv Detail & Related papers (2025-12-02T08:31:31Z) - GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AI [52.13138825802668]
GeoFMs are transforming Earth Observation, but evaluation lacks standardized protocols.<n> GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation.<n>Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
arXiv Detail & Related papers (2025-11-19T17:45:02Z) - Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering [29.961363790887003]
We reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture.<n>We then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation.<n>OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop.
arXiv Detail & Related papers (2025-11-19T03:08:20Z) - Retrieval-augmented reasoning with lean language models [5.615564811138556]
We develop a retrieval augmented conversational agent capable of interpreting complex, domain-specific queries.<n>Our system integrates a dense retriever with fine-tuned Qwen2.5-Instruct models.<n>All implementation details and code are publicly released to support and adaptation across domains.
arXiv Detail & Related papers (2025-08-15T10:38:15Z) - OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking [21.177773831820673]
Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome.<n>As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation.<n>We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs.
arXiv Detail & Related papers (2025-05-20T14:16:25Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Genomic Interpreter: A Hierarchical Genomic Deep Neural Network with 1D
Shifted Window Transformer [4.059849656394191]
Genomic Interpreter is a novel architecture for genomic assay prediction.
Model can identify hierarchical dependencies in genomic sites.
Evaluated on a dataset containing 38,171 DNA segments of 17K pairs.
arXiv Detail & Related papers (2023-06-08T12:10:13Z) - TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations.
We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.