MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking
- URL: http://arxiv.org/abs/2511.10887v1
- Date: Fri, 14 Nov 2025 01:49:24 GMT
- Title: MedPath: Multi-Domain Cross-Vocabulary Hierarchical Paths for Biomedical Entity Linking
- Authors: Nishant Mishra, Wilker Aziz, Iacer Calixto,
- Abstract summary: We present MedPath, a large-scale and multi-domain biomedical Entity Linking dataset.<n>All entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies.<n>MedPath enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems.
- Score: 4.590229697778086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Progress in biomedical Named Entity Recognition (NER) and Entity Linking (EL) is currently hindered by a fragmented data landscape, a lack of resources for building explainable models, and the limitations of semantically-blind evaluation metrics. To address these challenges, we present MedPath, a large-scale and multi-domain biomedical EL dataset that builds upon nine existing expert-annotated EL datasets. In MedPath, all entities are 1) normalized using the latest version of the Unified Medical Language System (UMLS), 2) augmented with mappings to 62 other biomedical vocabularies and, crucially, 3) enriched with full ontological paths -- i.e., from general to specific -- in up to 11 biomedical vocabularies. MedPath directly enables new research frontiers in biomedical NLP, facilitating training and evaluation of semantic-rich and interpretable EL systems, and the development of the next generation of interoperable and explainable clinical NLP models.
Related papers
- Automated Hierarchical Graph Construction for Multi-source Electronic Health Records [17.122817545326928]
We propose MASH, a fully automated framework that aligns medical codes across institutions using neural optimal transport.<n>MASH integrates information from pre-trained language models, co-occurrence patterns, textual descriptions, and supervised labels.<n>It produces interpretable hierarchical graphs that facilitate the navigation and understanding of heterogeneous clinical data.
arXiv Detail & Related papers (2025-09-08T11:45:59Z) - Biomedical Literature Q&A System Using Retrieval-Augmented Generation (RAG) [0.0]
This work presents a Biomedical Literature Question Answering (Q&A) system based on a Retrieval-Augmented Generation architecture.<n>The system integrates diverse sources, including PubMed articles, curated Q&A datasets, and medical encyclopedias.<n>The system supports both general medical queries and domain-specific tasks, with a focused evaluation on breast cancer literature.
arXiv Detail & Related papers (2025-09-05T21:29:52Z) - A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI [70.06771291117965]
We introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset.<n>Biomedica contains over 6 million scientific articles and 24 million image-text pairs.<n>We provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems.
arXiv Detail & Related papers (2025-03-26T05:56:46Z) - BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature [73.39593644054865]
BIOMEDICA is a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset.<n>Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles.<n> BMCA-CLIP is a suite of CLIP-style models continuously pretrained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally.
arXiv Detail & Related papers (2025-01-13T09:58:03Z) - Multi-level biomedical NER through multi-granularity embeddings and
enhanced labeling [3.8599767910528917]
This paper proposes a hybrid approach that integrates the strengths of multiple models.
BERT provides contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLSTM + CRF for sequence labelling and modelling dependencies between the words in the text.
We evaluate our model on the benchmark i2b2/2010 dataset, achieving an F1-score of 90.11.
arXiv Detail & Related papers (2023-12-24T21:45:36Z) - Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - EBOCA: Evidences for BiOmedical Concepts Association Ontology [55.41644538483948]
This paper proposes EBOCA, an ontology that describes (i) biomedical domain concepts and associations between them, and (ii) evidences supporting these associations.
Test data coming from a subset of DISNET and automatic association extractions from texts has been transformed to create a Knowledge Graph that can be used in real scenarios.
arXiv Detail & Related papers (2022-08-01T18:47:03Z) - Cross-Domain Data Integration for Named Entity Disambiguation in
Biomedical Text [5.008513565240167]
We propose a cross-domain data integration method that transfers structural knowledge from a general text knowledge base to the medical domain.
We utilize our integration scheme to augment structural resources and generate a large biomedical NED dataset for pretraining.
Our pretrained model with injected structural knowledge achieves state-of-the-art performance on two benchmark medical NED datasets: MedMentions and BC5CDR.
arXiv Detail & Related papers (2021-10-15T17:38:16Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual
Embeddings Using the Unified Medical Language System Metathesaurus [73.86656026386038]
We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process.
By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models.
arXiv Detail & Related papers (2020-10-20T15:56:31Z) - COMETA: A Corpus for Medical Entity Linking in the Social Media [27.13349965075764]
We introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT.
Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality.
We shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios.
arXiv Detail & Related papers (2020-10-07T09:16:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.