BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs
- URL: http://arxiv.org/abs/2602.17680v1
- Date: Wed, 04 Feb 2026 13:24:49 GMT
- Title: BioBridge: Bridging Proteins and Language for Enhanced Biological Reasoning with LLMs
- Authors: Yujia Wang, Jihong Guan, Wengen Li, Shuigeng Zhou, Xuhong Wang,
- Abstract summary: BioBridge is a domain-adaptive continual pretraining framework for protein understanding.<n>Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks.
- Score: 40.50730320622891
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing Protein Language Models (PLMs) often suffer from limited adaptability to multiple tasks and exhibit poor generalization across diverse biological contexts. In contrast, general-purpose Large Language Models (LLMs) lack the capability to interpret protein sequences and fall short in domain-specific knowledge, limiting their capacity for effective biosemantic reasoning. To combine the advantages of both, we propose BioBridge, a domain-adaptive continual pretraining framework for protein understanding. This framework employs Domain-Incremental Continual Pre-training (DICP) to infuse protein domain knowledge and general reasoning corpus into a LLM simultaneously, effectively mitigating catastrophic forgetting. Cross-modal alignment is achieved via a PLM-Projector-LLM pipeline, which maps protein sequence embeddings into the semantic space of the language model. Ultimately, an end-to-end optimization is adopted to uniformly support various tasks, including protein property prediction and knowledge question-answering. Our proposed BioBridge demonstrates performance comparable to that of mainstream PLMs on multiple protein benchmarks, such as EC and BindingDB. It also achieves results on par with LLMs on general understanding tasks like MMLU and RACE. This showcases its innovative advantage of combining domain-specific adaptability with general-purpose language competency.
Related papers
- Self Distillation Fine-Tuning of Protein Language Models Improves Versatility in Protein Design [61.2846583160056]
Supervised fine-tuning (SFT) is a standard approach for adapting large language models to specialized domains.<n>This is in part because high-quality annotated data are far more difficult to obtain for proteins than for natural language.<n>We present a simple and general recipe for fast SFT of PLMs, designed to improve the fidelity, reliability, and novelty of generated protein sequences.
arXiv Detail & Related papers (2025-12-10T05:34:47Z) - Protein as a Second Language for LLMs [50.34983283157322]
"Protein-as-Second-Language" framework reformulates amino-acid sequences as sentences in a novel symbolic language.<n>We curate a bilingual corpus of 79,926 protein-QA instances spanning attribute prediction, descriptive understanding, and extended reasoning.<n>Our method delivers consistent gains across diverse open-source LLMs and GPT-4, achieving up to 17.2% ROUGE-L improvement.
arXiv Detail & Related papers (2025-10-13T09:21:45Z) - BioVERSE: Representation Alignment of Biomedical Modalities to LLMs for Multi-Modal Reasoning [0.36855563110245826]
We present BIOVERSE, a two-stage approach that adapts pretrained BioFMs as modality encoders.<n>The approach first aligns each modality to a shared LLM space.<n>It then applies standard instruction tuning with multi-modal data to bring them together for downstream reasoning.
arXiv Detail & Related papers (2025-10-01T20:07:36Z) - PLM-eXplain: Divide and Conquer the Protein Embedding Space [0.0]
We present an explainable adapter layer - PLM-eXplain (PLM-X)<n>PLM-X bridges the gap by factoring PLM embeddings into two components: an interpretable subspace based on established biochemical features, and a residual subspace that preserves the model's predictive power.<n>We demonstrate the effectiveness of our approach across three protein-level classification tasks.
arXiv Detail & Related papers (2025-04-09T10:46:24Z) - Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models [55.74944165932666]
We introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences.<n>This dataset bridges large language models (LLMs) and complex biological sequence-related tasks, enhancing their versatility and reasoning.<n>We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training.
arXiv Detail & Related papers (2024-12-26T12:12:23Z) - Long-context Protein Language Modeling Using Bidirectional Mamba with Shared Projection Layers [76.95505296417866]
Self-supervised training of language models (LMs) has seen great success for protein sequences in learning meaningful representations and for generative drug design.<n>Most protein LMs are based on the Transformer architecture trained on individual proteins with short context lengths.<n>In this work, we propose LC-PLM based on an alternative protein LM architecture, BiMamba-S, built upon selective structured state-space models.
arXiv Detail & Related papers (2024-10-29T16:43:28Z) - Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding with LLMs [43.811432723460534]
We introduce Structure-Enhanced Protein Instruction Tuning (SEPIT) framework to bridge this gap.<n>Our approach incorporates a novel structure-aware module into pLMs to enrich their structural knowledge.<n>We construct the largest and most comprehensive protein instruction dataset to date, which allows us to train and evaluate the general-purpose protein understanding model.
arXiv Detail & Related papers (2024-10-04T16:02:50Z) - ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction [54.132290875513405]
The prediction of protein-protein interactions (PPIs) is crucial for understanding biological functions and diseases.
Previous machine learning approaches to PPI prediction mainly focus on direct physical interactions.
We propose a novel framework ProLLM that employs an LLM tailored for PPI for the first time.
arXiv Detail & Related papers (2024-03-30T05:32:42Z) - Linguistically inspired roadmap for building biologically reliable
protein language models [0.5412332666265471]
We argue that guidance drawn from linguistics can aid with building more interpretable protein LMs.
We provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation.
arXiv Detail & Related papers (2022-07-03T08:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.