BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025
- URL: http://arxiv.org/abs/2509.09725v1
- Date: Wed, 10 Sep 2025 09:14:25 GMT
- Title: BIBERT-Pipe on Biomedical Nested Named Entity Linking at BioASQ 2025
- Authors: Chunyu Li, Xindi Zheng, Siqi Liu,
- Abstract summary: We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English & Russian)<n>We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning.<n>On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications.
- Score: 5.329747408496098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Entity linking (EL) for biomedical text is typically benchmarked on English-only corpora with flat mentions, leaving the more realistic scenario of nested and multilingual mentions largely unexplored. We present our system for the BioNNE 2025 Multilingual Biomedical Nested Named Entity Linking shared task (English & Russian), closing this gap with a lightweight pipeline that keeps the original EL model intact and modifies only three task-aligned components: Two-stage retrieval-ranking. We leverage the same base encoder model in both stages: the retrieval stage uses the original pre-trained model, while the ranking stage applies domain-specific fine-tuning. Boundary cues. In the ranking stage, we wrap each mention with learnable [Ms] / [Me] tags, providing the encoder with an explicit, language-agnostic span before robustness to overlap and nesting. Dataset augmentation. We also automatically expand the ranking training corpus with three complementary data sources, enhancing coverage without extra manual annotation. On the BioNNE 2025 leaderboard, our two stage system, bilingual bert (BIBERT-Pipe), ranks third in the multilingual track, demonstrating the effectiveness and competitiveness of these minimal yet principled modifications. Code are publicly available at https://github.com/Kaggle-Competitions-Code/BioNNE-L.
Related papers
- A Unified Biomedical Named Entity Recognition Framework with Large Language Models [44.92744341698289]
We propose a unified Biomedical Named Entity Recognition (BioNER) framework based on Large Language Models (LLMs)<n>We first reformulate BioNER as a text generation task and design a symbolic tagging strategy to jointly handle both flat and nested entities.<n>We perform bilingual joint fine-tuning across multiple Chinese and English datasets.
arXiv Detail & Related papers (2025-10-10T01:33:54Z) - ClinLinker: Medical Entity Linking of Clinical Concept Mentions in Spanish [39.81302995670643]
This study presents ClinLinker, a novel approach employing a two-phase pipeline for medical entity linking.
It is based on a SapBERT-based bi-encoder and subsequent re-ranking with a cross-encoder, trained by following a contrastive-learning strategy to be tailored to medical concepts in Spanish.
arXiv Detail & Related papers (2024-04-09T15:04:27Z) - Advancing Multilingual Pre-training: TRIP Triangular Document-level
Pre-training for Multilingual Language Models [107.83158521848372]
We present textbfTriangular Document-level textbfPre-training (textbfTRIP), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting.
TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
arXiv Detail & Related papers (2022-12-15T12:14:25Z) - Building Chinese Biomedical Language Models via Multi-Level Text
Discrimination [24.992542216072152]
We introduce eHealth, a biomedical PLM in Chinese built with a new pre-training framework.
This new framework trains eHealth as a discriminator through both token-level and sequence-level discrimination.
EHealth can learn language semantics at both the token and sequence levels.
arXiv Detail & Related papers (2021-10-14T10:43:28Z) - Fast and Effective Biomedical Entity Linking Using a Dual Encoder [48.86736921025866]
We propose a BERT-based dual encoder model that resolves multiple mentions in a document in one shot.
We show that our proposed model is multiple times faster than existing BERT-based models while being competitive in accuracy for biomedical entity linking.
arXiv Detail & Related papers (2021-03-08T19:32:28Z) - Improved Biomedical Word Embeddings in the Transformer Era [2.978663539080876]
We learn word and concept embeddings by first using the skip-gram method and further fine-tuning them with correlational information.
We conduct evaluations of these tuned static embeddings using multiple datasets for word relatedness developed by previous efforts.
arXiv Detail & Related papers (2020-12-22T03:03:50Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - MULTISEM at SemEval-2020 Task 3: Fine-tuning BERT for Lexical Meaning [6.167728295758172]
We present the MULTISEM systems submitted to SemEval 2020 Task 3: Graded Word Similarity in Context (GWSC)
We experiment with injecting semantic knowledge into pre-trained BERT models through fine-tuning on lexical semantic tasks related to GWSC.
We use existing semantically annotated datasets and propose to approximate similarity through automatically generated lexical substitutes in context.
arXiv Detail & Related papers (2020-07-24T09:50:26Z) - Students Need More Attention: BERT-based AttentionModel for Small Data
with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining)
We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets.
As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z) - CBAG: Conditional Biomedical Abstract Generation [1.2633386045916442]
We propose a transformer-based conditional language model with a shallow encoder "condition" stack, and a deep "language model" stack of multi-headed attention blocks.
We generate biomedical abstracts given only a proposed title, an intended publication year, and a set of keywords.
arXiv Detail & Related papers (2020-02-13T17:11:33Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.