Related papers: MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition

MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition

URL: http://arxiv.org/abs/2505.12964v1
Date: Mon, 19 May 2025 11:00:43 GMT
Title: MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition
Authors: Shanshan Liu, Noriki Nishida, Rumana Ferdous Munne, Narumi Tokunaga, Yuki Yamagata, Kouji Kozaki, Yuji Matsumoto,
Abstract summary: We introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task.<n>By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency.<n>Our results highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations.
Score: 8.635416307171035
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.

Related papers

OntologyRAG: Better and Faster Biomedical Code Mapping with Retrieval-Augmented Generation (RAG) Leveraging Ontology Knowledge Graphs and Large Language Models [1.2941187430993801]
We create OntologyRAG, a retrieval-augmented generation (RAG) method for in-context-learning representations.<n>Our solution grounds LLMs to knowledge graphs with mappings between and processes questions.<n>Our solution doesn't require re-training LMs, as all ontology updates could be reflected by updating the knowledge graphs with a standard process.
arXiv Detail & Related papers (2025-02-26T09:56:10Z)
Causal Representation Learning from Multimodal Biomedical Observations [57.00712157758845]
We develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets.<n>Key theoretical contribution is the structural sparsity of causal connections between modalities.<n>Results on a real-world human phenotype dataset are consistent with established biomedical research.
arXiv Detail & Related papers (2024-11-10T16:40:27Z)
On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey [82.49623756124357]
Zero-shot image recognition (ZSIR) aims to recognize and reason in unseen domains by learning generalized knowledge from limited data.<n>This paper thoroughly investigates recent advances in element-wise ZSIR and provides a basis for its future development.
arXiv Detail & Related papers (2024-08-09T05:49:21Z)
Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery [52.498055901649025]
Concept Bottleneck Models (CBMs) have been proposed to address the 'black-box' problem of deep neural networks. We propose a novel CBM approach -- called Discover-then-Name-CBM (DN-CBM) -- that inverts the typical paradigm. Our concept extraction strategy is efficient, since it is agnostic to the downstream task, and uses concepts already known to the model.
arXiv Detail & Related papers (2024-07-19T17:50:11Z)
Towards Ontology-Enhanced Representation Learning for Large Language Models [0.18416014644193066]
We propose a novel approach to improve an embedding-Large Language Model (embedding-LLM) of interest by infusing knowledge by a reference ontology. The linguistic information (i.e. concept synonyms and descriptions) and structural information (i.e. is-a relations) are utilized to compile a comprehensive set of concept definitions. These concept definitions are then employed to fine-tune the target embedding-LLM using a contrastive learning framework.
arXiv Detail & Related papers (2024-05-30T23:01:10Z)
Interpretable Prognostics with Concept Bottleneck Models [5.939858158928473]
Concept Bottleneck Models (CBMs) are inherently interpretable neural network architectures based on concept explanations. CBMs enable domain experts to intervene on the concept activations at test-time. Our case studies demonstrate that the performance of CBMs can be on par or superior to black-box models.
arXiv Detail & Related papers (2024-05-27T18:15:40Z)
MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning [48.97640824497327]
We propose a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder, local token-knowledge-patch alignment contrastive learning, and knowledge-guided category-level contrastive learning with expert knowledge. Notably, MLIP surpasses state-of-the-art methods even with limited annotated data, highlighting the potential of multimodal pre-training in advancing medical representation learning.
arXiv Detail & Related papers (2024-02-03T05:48:50Z)
Implicit Concept Removal of Diffusion Models [92.55152501707995]
Text-to-image (T2I) diffusion models often inadvertently generate unwanted concepts such as watermarks and unsafe images. We present the Geom-Erasing, a novel concept removal method based on the geometric-driven control.
arXiv Detail & Related papers (2023-10-09T17:13:10Z)
Ontology Enrichment from Texts: A Biomedical Dataset for Concept Discovery and Placement [22.074094839360413]
Mentions of new concepts appear regularly in texts and require automated approaches to harvest and place them into Knowledge Bases. Existing datasets suffer from three issues, (i) mostly assuming that a new concept is pre-discovered and cannot support out-of-KB mention discovery. We provide usage on the evaluation with the dataset for out-of-KB mention discovery and concept placement, recent Large Language Model based methods.
arXiv Detail & Related papers (2023-06-26T13:54:47Z)
HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting [33.1455954220194]
HiPrompt is a supervision-efficient knowledge fusion framework. It elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt.
arXiv Detail & Related papers (2023-04-12T16:54:26Z)
Semantic Search for Large Scale Clinical Ontologies [63.71950996116403]
We present a deep learning approach to build a search system for large clinical vocabularies. We propose a Triplet-BERT model and a method that generates training data based on semantic training data. The model is evaluated using five real benchmark data sets and the results show that our approach achieves high results on both free text to concept and concept to searching concept vocabularies.
arXiv Detail & Related papers (2022-01-01T05:15:42Z)
A Meta-embedding-based Ensemble Approach for ICD Coding Prediction [64.42386426730695]
International Classification of Diseases (ICD) are the de facto codes used globally for clinical coding. These codes enable healthcare providers to claim reimbursement and facilitate efficient storage and retrieval of diagnostic information. Our proposed approach enhances the performance of neural models by effectively training word vectors using routine medical data as well as external knowledge from scientific articles.
arXiv Detail & Related papers (2021-02-26T17:49:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.