Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
- URL: http://arxiv.org/abs/2412.19191v2
- Date: Tue, 23 Sep 2025 12:55:03 GMT
- Title: Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models
- Authors: Haonan He, Yuchen Ren, Yining Tang, Ziyang Xu, Junxian Li, Minghao Yang, Di Zhang, Dong Yuan, Tao Chen, Shufei Zhang, Yuqiang Li, Nanqing Dong, Wanli Ouyang, Dongzhan Zhou, Peng Ye,
- Abstract summary: We introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences.<n>This dataset bridges large language models (LLMs) and complex biological sequence-related tasks, enhancing their versatility and reasoning.<n>We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training.
- Score: 55.74944165932666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: https://github.com/hhnqqq/Biology-Instructions.
Related papers
- BioVERSE: Representation Alignment of Biomedical Modalities to LLMs for Multi-Modal Reasoning [0.36855563110245826]
We present BIOVERSE, a two-stage approach that adapts pretrained BioFMs as modality encoders.<n>The approach first aligns each modality to a shared LLM space.<n>It then applies standard instruction tuning with multi-modal data to bring them together for downstream reasoning.
arXiv Detail & Related papers (2025-10-01T20:07:36Z) - Large Language Models in Bioinformatics: A Survey [13.722344139230827]
Large Language Models (LLMs) are revolutionizing bioinformatics, enabling advanced analysis of DNA, RNA, proteins, and single-cell data.
This survey provides a systematic review of recent advancements, focusing on genomic sequence modeling, RNA structure prediction, protein function inference, and single-cell transcriptomics.
arXiv Detail & Related papers (2025-03-06T14:38:20Z) - Biological Sequence with Language Model Prompting: A Survey [14.270959261105968]
Large Language models (LLMs) have emerged as powerful tools for addressing challenges across diverse domains.
This paper systematically investigates the application of prompt-based methods with LLMs to biological sequences.
arXiv Detail & Related papers (2025-03-06T06:28:36Z) - BioMaze: Benchmarking and Enhancing Large Language Models for Biological Pathway Reasoning [49.487327661584686]
We introduce BioMaze, a dataset with 5.1K complex pathway problems from real research.
Our evaluation of methods such as CoT and graph-augmented reasoning, shows that LLMs struggle with pathway reasoning.
To address this, we propose PathSeeker, an LLM agent that enhances reasoning through interactive subgraph-based navigation.
arXiv Detail & Related papers (2025-02-23T17:38:10Z) - Life-Code: Central Dogma Modeling with Multi-Omics Sequence Unification [53.488387420073536]
Life-Code is a comprehensive framework that spans different biological functions.
Life-Code achieves state-of-the-art performance on various tasks across three omics.
arXiv Detail & Related papers (2025-02-11T06:53:59Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models [56.81513758682858]
COMET aims to evaluate models across single-omics, cross-omics, and multi-omics tasks.<n>First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins.<n>Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method.
arXiv Detail & Related papers (2024-12-13T18:42:00Z) - MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language [0.4631438140637248]
MAMMAL is a versatile method applied to create a multi-task foundation model that learns from large-scale biological datasets across diverse modalities.<n> evaluated on eleven diverse downstream tasks, it reaches a new state of the art (SOTA) in nine tasks and is comparable to SOTA in two tasks.<n> explored Alphafold 3 binding prediction capabilities on antibody-antigen and nanobody-antigen complexes showing significantly better classification performance of MAMMAL in 3 out of 4 targets.
arXiv Detail & Related papers (2024-10-28T20:45:52Z) - BSM: Small but Powerful Biological Sequence Model for Genes and Proteins [6.6055625629542085]
We introduce BSM, a small but powerful mixed-modal biological sequence foundation model.
It is trained on three types of data: RefSeq, Gene Related Sequences, and interleaved biological sequences from the web.
It significantly enhances learning efficiency and cross-modal representation, outperforming models trained solely on unimodal data.
arXiv Detail & Related papers (2024-10-15T11:12:28Z) - Multimodal Large Language Models for Bioimage Analysis [39.120941702559726]
Multimodal Large Language Models (MLLMs) exhibit strong emergent capacities, such as understanding, analyzing, reasoning, and generalization.
With these capabilities, MLLMs hold promise to extract intricate information from biological images and data obtained through various modalities.
Development of MLLMs shows increasing promise in serving as intelligent assistants or agents for augmenting human researchers in biology research.
arXiv Detail & Related papers (2024-07-29T08:21:25Z) - Genomic Language Models: Opportunities and Challenges [0.2912705470788796]
Genomic Language Models (gLMs) have the potential to significantly advance our understanding of genomes.
We highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning.
We discuss major considerations for developing and evaluating gLMs.
arXiv Detail & Related papers (2024-07-16T06:57:35Z) - An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks.
These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems.
Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z) - When large language models meet evolutionary algorithms [48.213640761641926]
Pre-trained large language models (LLMs) have powerful capabilities for generating creative natural text.
Evolutionary algorithms (EAs) can discover diverse solutions to complex real-world problems.
Motivated by the common collective and directionality of text generation and evolution, this paper illustrates the parallels between LLMs and EAs.
arXiv Detail & Related papers (2024-01-19T05:58:30Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab [67.24684071577211]
The challenge of replicating research results has posed a significant impediment to the field of molecular biology.
We first curate a comprehensive multimodal dataset, named ProBio, as an initial step towards this objective.
Next, we devise two challenging benchmarks, transparent solution tracking and multimodal action recognition, to emphasize the unique characteristics and difficulties associated with activity understanding in BioLab settings.
arXiv Detail & Related papers (2023-11-01T14:44:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.