Investigating Data Pruning for Pretraining Biological Foundation Models at Scale
- URL: http://arxiv.org/abs/2512.12932v1
- Date: Mon, 15 Dec 2025 02:42:52 GMT
- Title: Investigating Data Pruning for Pretraining Biological Foundation Models at Scale
- Authors: Yifan Wu, Jiyue Jiang, Xichen Ye, Yiqi Wang, Chang Zhou, Yitao Xu, Jiayang Chen, He Hu, Weizhong Zhang, Cheng Jin, Jiao Yuan, Yu Li,
- Abstract summary: We propose a post-hoc influence-guided data pruning framework tailored to biological domains.<n>Our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent.<n>These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining.
- Score: 47.09153330837959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.
Related papers
- Investigating the Impact of Histopathological Foundation Models on Regressive Prediction of Homologous Recombination Deficiency [52.50039435394964]
We systematically evaluate foundation models for regression-based tasks.<n>We extract patch-level features from whole slide images (WSI) using five state-of-the-art foundation models.<n>Models are trained to predict continuous HRD scores based on these extracted features across breast, endometrial, and lung cancer cohorts.
arXiv Detail & Related papers (2026-01-29T14:06:50Z) - Harnessing Large Language Models for Biomedical Named Entity Recognition [4.376764535031509]
BioNER is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching.<n>We introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning general-domain Large Language Models.<n>Our model, trained on only 50% of the curated positive data, surpasses the fully-trained baseline.
arXiv Detail & Related papers (2025-12-28T01:34:23Z) - Rep3Net: An Approach Exploiting Multimodal Representation for Molecular Bioactivity Prediction [0.8049701904919515]
In early stage drug discovery, bioactivity prediction of molecules against target proteins plays a crucial role.<n>We propose Rep3Net, a unified deep learning architecture that not only incorporates descriptor data but also includes spatial and relational information.<n>Our model employing multimodald features produce reliable bioactivity prediction on Poly [ADP-ribose] polymerase 1 dataset.
arXiv Detail & Related papers (2025-11-29T15:39:48Z) - BioBO: Biology-informed Bayesian Optimization for Perturbation Design [10.086893225706321]
We propose Biology-Informed Bayesian Optimization (BioBO) to enhance surrogate modeling and acquisition strategies.<n>BioBO combines biologically grounded priors with acquisition functions in a principled framework, which biases the search toward promising genes.<n>We show that BioBO improves labeling efficiency by 25-40%, and consistently outperforms conventional BO by identifying top-performing perturbations.
arXiv Detail & Related papers (2025-09-24T10:50:06Z) - RanAT4BIE: Random Adversarial Training for Biomedical Information Extraction [3.350193187012561]
Random adversarial training (RAT) is a novel framework successfully applied to biomedical information extraction tasks.<n>RAT integrates random sampling mechanisms with adversarial training principles, achieving enhanced model generalization and robustness.<n>Results highlight RAT's potential as a transformative framework for biomedical natural language processing.
arXiv Detail & Related papers (2025-09-14T09:40:00Z) - CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis [51.56484100374058]
We introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology.<n>Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
arXiv Detail & Related papers (2025-09-02T03:30:07Z) - METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring [13.988975730867107]
We pretrain a metagenomic foundation model, METAGENE-1, on a novel corpus of diverse metagenomic DNA and RNA sequences.<n>This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic sequencing methods.<n>We show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining.
arXiv Detail & Related papers (2025-01-03T18:44:43Z) - Augmenting Biomedical Named Entity Recognition with General-domain Resources [47.24727904076347]
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations.<n>We propose GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training.<n>We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances.
arXiv Detail & Related papers (2024-06-15T15:28:02Z) - Progress and Opportunities of Foundation Models in Bioinformatics [77.74411726471439]
Foundations models (FMs) have ushered in a new era in computational biology, especially in the realm of deep learning.
Central to our focus is the application of FMs to specific biological problems, aiming to guide the research community in choosing appropriate FMs for their research needs.
Review analyses challenges and limitations faced by FMs in biology, such as data noise, model explainability, and potential biases.
arXiv Detail & Related papers (2024-02-06T02:29:17Z) - Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction.
We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z) - Deep Learning for Virtual Screening: Five Reasons to Use ROC Cost
Functions [80.12620331438052]
deep learning has become an important tool for rapid screening of billions of molecules in silico for potential hits containing desired chemical features.
Despite its importance, substantial challenges persist in training these models, such as severe class imbalance, high decision thresholds, and lack of ground truth labels in some datasets.
We argue in favor of directly optimizing the receiver operating characteristic (ROC) in such cases, due to its robustness to class imbalance.
arXiv Detail & Related papers (2020-06-25T08:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.