Related papers: Harnessing Large Language Models for Biomedical Named Entity Recognition

Harnessing Large Language Models for Biomedical Named Entity Recognition

URL: http://arxiv.org/abs/2512.22738v1
Date: Sun, 28 Dec 2025 01:34:23 GMT
Title: Harnessing Large Language Models for Biomedical Named Entity Recognition
Authors: Jian Chen, Leilei Su, Cong Sun,
Abstract summary: BioNER is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching.<n>We introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning general-domain Large Language Models.<n>Our model, trained on only 50% of the curated positive data, surpasses the fully-trained baseline.
Score: 4.376764535031509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background and Objective: Biomedical Named Entity Recognition (BioNER) is a foundational task in medical informatics, crucial for downstream applications like drug discovery and clinical trial matching. However, adapting general-domain Large Language Models (LLMs) to this task is often hampered by their lack of domain-specific knowledge and the performance degradation caused by low-quality training data. To address these challenges, we introduce BioSelectTune, a highly efficient, data-centric framework for fine-tuning LLMs that prioritizes data quality over quantity. Methods and Results: BioSelectTune reformulates BioNER as a structured JSON generation task and leverages our novel Hybrid Superfiltering strategy, a weak-to-strong data curation method that uses a homologous weak model to distill a compact, high-impact training dataset. Conclusions: Through extensive experiments, we demonstrate that BioSelectTune achieves state-of-the-art (SOTA) performance across multiple BioNER benchmarks. Notably, our model, trained on only 50% of the curated positive data, not only surpasses the fully-trained baseline but also outperforms powerful domain-specialized models like BioMedBERT.

Related papers

Investigating Data Pruning for Pretraining Biological Foundation Models at Scale [47.09153330837959]
We propose a post-hoc influence-guided data pruning framework tailored to biological domains.<n>Our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent.<n>These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining.
arXiv Detail & Related papers (2025-12-15T02:42:52Z)
CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis [51.56484100374058]
We introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology.<n>Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
arXiv Detail & Related papers (2025-09-02T03:30:07Z)
Augmenting Biomedical Named Entity Recognition with General-domain Resources [47.24727904076347]
Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations.<n>We propose GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training.<n>We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances.
arXiv Detail & Related papers (2024-06-15T15:28:02Z)
BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers [48.21255861863282]
BMRetriever is a series of dense retrievers for enhancing biomedical retrieval. BMRetriever exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger.
arXiv Detail & Related papers (2024-04-29T05:40:08Z)
Multi-level biomedical NER through multi-granularity embeddings and enhanced labeling [3.8599767910528917]
This paper proposes a hybrid approach that integrates the strengths of multiple models. BERT provides contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLSTM + CRF for sequence labelling and modelling dependencies between the words in the text. We evaluate our model on the benchmark i2b2/2010 dataset, achieving an F1-score of 90.11.
arXiv Detail & Related papers (2023-12-24T21:45:36Z)
Improving Biomedical Entity Linking with Retrieval-enhanced Learning [53.24726622142558]
$k$NN-BioEL provides a BioEL model with the ability to reference similar instances from the entire training corpus as clues for prediction. We show that $k$NN-BioEL outperforms state-of-the-art baselines on several datasets.
arXiv Detail & Related papers (2023-12-15T14:04:23Z)
BioREx: Improving Biomedical Relation Extraction by Leveraging Heterogeneous Datasets [7.7587371896752595]
Biomedical relation extraction (RE) is a central task in biomedical natural language processing (NLP) research. We present a novel framework for systematically addressing the data heterogeneity of individual datasets and combining them into a large dataset. Our evaluation shows that BioREx achieves significantly higher performance than the benchmark system trained on the individual dataset.
arXiv Detail & Related papers (2023-06-19T22:48:18Z)
BioADAPT-MRC: Adversarial Learning-based Domain Adaptation Improves Biomedical Machine Reading Comprehension Task [4.837365865245979]
We present an adversarial learning-based domain adaptation framework for the biomedical machine reading comprehension task. BioADAPT-MRC is a neural network-based method to address the discrepancies in the marginal distributions between the general and biomedical domain datasets.
arXiv Detail & Related papers (2022-02-26T16:14:27Z)
Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing [55.52858954615655]
We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains. We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications.
arXiv Detail & Related papers (2021-12-15T04:20:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.