BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
- URL: http://arxiv.org/abs/2505.23579v2
- Date: Fri, 24 Oct 2025 17:16:49 GMT
- Title: BioReason: Incentivizing Multimodal Biological Reasoning within a DNA-LLM Model
- Authors: Adibvafa Fallahpour, Andrew Magnuson, Purav Gupta, Shihao Ma, Jack Naimer, Arnav Shah, Haonan Duan, Omar Ibrahim, Hani Goodarzi, Chris J. Maddison, Bo Wang,
- Abstract summary: BioReason learns to produce logical, biologically coherent deductions.<n>It boosts KEGG-based disease pathway prediction accuracy from 86% to 98%.<n>It also improves variant effect prediction by an average of 15% over strong baselines.
- Score: 12.528834366422466
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Unlocking deep and interpretable biological reasoning from complex genomic data remains a major AI challenge limiting scientific progress. While current DNA foundation models excel at representing sequences, they struggle with multi-step reasoning and lack transparent, biologically meaningful explanations. BioReason addresses this by tightly integrating a DNA foundation model with a large language model (LLM), enabling the LLM to directly interpret and reason over genomic information. Through supervised fine-tuning and reinforcement learning, BioReason learns to produce logical, biologically coherent deductions. It achieves major performance gains, boosting KEGG-based disease pathway prediction accuracy from 86% to 98% and improving variant effect prediction by an average of 15% over strong baselines. BioReason can reason over unseen biological entities and explain its decisions step by step, offering a transformative framework for interpretable, mechanistic AI in biology. All data, code, and checkpoints are available at https://github.com/bowang-lab/BioReason
Related papers
- MEDNA-DFM: A Dual-View FiLM-MoE Model for Explainable DNA Methylation Prediction [7.3621714430935805]
We introduce a high-performance model MEDNA-DFM, alongside mechanism-inspired signal purification algorithms.<n>Our investigation demonstrates that MEDNA-DFM effectively captures conserved methylation patterns.<n>Applying our developed algorithms extracted motifs with significantly higher reliability than prior studies.
arXiv Detail & Related papers (2026-02-26T10:38:41Z) - Progressive Multi-Agent Reasoning for Biological Perturbation Prediction [32.71169480836875]
We present LINCSQA, a novel benchmark for predicting target gene regulation under complex chemical perturbations.<n>We also propose PBio-Agent, a multi-agent framework that integrates difficulty-aware task sequencing with iterative knowledge refinement.<n>Our key insight is that genes affected by the same perturbation share causal structure, allowing confidently predicted genes to contextualize more challenging cases.
arXiv Detail & Related papers (2026-02-07T06:59:44Z) - BABE: Biology Arena BEnchmark [51.53220868983288]
BABE is a benchmark designed to evaluate the experimental reasoning capabilities of biological AI systems.<n>Our benchmark provides a robust framework for assessing how well AI systems can reason like practicing scientists.
arXiv Detail & Related papers (2026-02-05T16:39:20Z) - Knowledge-Augmented Long-CoT Generation for Complex Biomolecular Reasoning [51.673503054645415]
Biomolecular mechanisms require multi-step reasoning across molecular interactions, signaling cascades, and metabolic pathways.<n>Existing approaches often exacerbate these issues: reasoning steps may deviate from biological facts or fail to capture long mechanistic dependencies.<n>We propose a Knowledge-Augmented Long-CoT Reasoning framework that integrates LLMs with knowledge graph-based multi-hop reasoning chains.
arXiv Detail & Related papers (2025-11-11T09:26:32Z) - BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning [49.487327661584686]
We introduce BioMaze, a dataset with 5.1K complex pathway problems from real research.<n>Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning.<n>To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation.
arXiv Detail & Related papers (2025-02-23T17:38:10Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Biology Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [51.316001071698224]
We introduce Biology-Instructions, the first large-scale multi-omics biological sequences-related instruction-tuning dataset.<n>This dataset can bridge the gap between large language models (LLMs) and complex biological sequences-related tasks.<n>We also develop a strong baseline called ChatMultiOmics with a novel three-stage training pipeline.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Causal Representation Learning from Multimodal Biomedical Observations [57.00712157758845]
We develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets.<n>Key theoretical contribution is the structural sparsity of causal connections between modalities.<n>Results on a real-world human phenotype dataset are consistent with established biomedical research.
arXiv Detail & Related papers (2024-11-10T16:40:27Z) - A Review of BioTree Construction in the Context of Information Fusion: Priors, Methods, Applications and Trends [41.740569399988644]
Biological tree (BioTree) analysis is a foundational tool in biology, enabling the exploration of evolutionary and differentiation.<n>Traditional tree construction methods face challenges in handling the growing complexity and scale of modern biological data.<n>Advances in deep learning (DL) offer transformative opportunities by enabling the fusion of biological prior knowledge with data-driven models.
arXiv Detail & Related papers (2024-10-07T08:00:41Z) - ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology.
We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective.
Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Deep Learning in Computational Biology: Advancements, Challenges, and
Future Outlook [0.0]
We examine the history, advantages, and challenges of deep learning in computational biology.
Our focus is on two primary applications: DNA sequence classification and prediction, as well as protein structure prediction from sequence data.
arXiv Detail & Related papers (2023-10-02T07:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.