Incorporating LLM Embeddings for Variation Across the Human Genome
- URL: http://arxiv.org/abs/2509.20702v1
- Date: Thu, 25 Sep 2025 03:09:16 GMT
- Title: Incorporating LLM Embeddings for Variation Across the Human Genome
- Authors: Hongqian Niu, Jordan Bryan, Xihao Li, Didong Li,
- Abstract summary: We present one of the first systematic frameworks to generate variant-level embeddings across the entire human genome.<n>Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we constructed semantic text descriptions for 8.9 billion possible variants.<n>Embeddings were produced with both OpenAI's text-em-3-large and the open-source Qwen3-Embedding-0.6B models.
- Score: 7.919252190254812
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language model (LLM) embeddings have enabled powerful representations for biological data, but most applications to date focus only on gene-level information. We present one of the first systematic frameworks to generate variant-level embeddings across the entire human genome. Using curated annotations from FAVOR, ClinVar, and the GWAS Catalog, we constructed semantic text descriptions for 8.9 billion possible variants and generated embeddings at three scales: 1.5 million HapMap3+MEGA variants, ~90 million imputed UK Biobank variants, and ~9 billion all possible variants. Embeddings were produced with both OpenAI's text-embedding-3-large and the open-source Qwen3-Embedding-0.6B models. Baseline experiments demonstrate high predictive accuracy for variant properties, validating the embeddings as structured representations of genomic variation. We outline two downstream applications: embedding-informed hypothesis testing by extending the Frequentist And Bayesian framework to genome-wide association studies, and embedding-augmented genetic risk prediction that enhances standard polygenic risk scores. These resources, publicly available on Hugging Face, provide a foundation for advancing large-scale genomic discovery and precision medicine.
Related papers
- AgriVariant: Variant Effect Prediction using DeepChem-Variant for Precision Breeding in Rice [0.0]
AgriVariant is an end-to-end pipeline for variant-effect prediction in rice (Oryza sativa)<n>Our approach integrates deep learning-based variant calling (DeepChem-Variant) with custom plant genomics annotation.<n>We validate the pipeline through targeted mutations in stress-response genes.
arXiv Detail & Related papers (2026-02-19T14:03:37Z) - Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering [29.961363790887003]
We reproduce GeneGPT in a pilot study using open source models, including Llama 3.1, Qwen2.5, and Qwen2.5 Coder, within a monolithic architecture.<n>We then develop OpenBioLLM, a modular multi-agent framework that extends GeneGPT by introducing agent specialization for tool routing, query generation, and response validation.<n>OpenBioLLM matches or outperforms GeneGPT on over 90% of the benchmark tasks, achieving average scores of 0.849 on Gene-Turing and 0.830 on GeneHop.
arXiv Detail & Related papers (2025-11-19T03:08:20Z) - EnTao-GPM: DNA Foundation Model for Predicting the Germline Pathogenic Mutations [16.32431932781823]
Cross-species targeted pre-training on disease-relevant mammalian genomes (human, pig, mouse)<n> Germline mutation specialization via fine-tuning on ClinVar and HGMD.<n>Interpretable clinical framework integrating DNA sequence embeddings with LLM-based statistical explanations.
arXiv Detail & Related papers (2025-07-29T11:34:41Z) - GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis [12.311957227670598]
GenoMAS orchestrates six specialized agents through typed message-passing protocols.<n>At the heart of GenoMAS lies a guided-planning framework.<n>GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature.
arXiv Detail & Related papers (2025-07-28T17:55:08Z) - Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models [0.14999444543328289]
NeuroEmbed is an approach for the engineering of semantically accurate embedding spaces to represent cohorts and samples.<n>The NeuroEmbed method comprises four stages: (1) extraction of cohorts from public repositories; (2) semi-automated normalization and augmentation of metadata of cohorts and samples using biomedical clustering and clustering on the embedding space; (3) automated generation of a natural language question-answering dataset for cohorts and samples based on randomized combinations of standardized metadata dimensions; and (4) fine-tuning of a domain-specific embedder to optimize queries.
arXiv Detail & Related papers (2025-06-16T13:27:10Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models [35.084222907099644]
We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling.<n>FreeFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.
arXiv Detail & Related papers (2024-10-02T17:53:08Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Unsupervised ensemble-based phenotyping helps enhance the
discoverability of genes related to heart morphology [57.25098075813054]
We propose a new framework for gene discovery entitled Un Phenotype Ensembles.
It builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner.
These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations.
arXiv Detail & Related papers (2023-01-07T18:36:44Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.