Related papers: Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models

Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models

URL: http://arxiv.org/abs/2506.13467v1
Date: Mon, 16 Jun 2025 13:27:10 GMT
Title: Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models
Authors: José A. Pardo, Alicia Gómez-Pascual, José T. Palma, Juan A. Botía,
Abstract summary: NeuroEmbed is an approach for the engineering of semantically accurate embedding spaces to represent cohorts and samples.<n>The NeuroEmbed method comprises four stages: (1) extraction of cohorts from public repositories; (2) semi-automated normalization and augmentation of metadata of cohorts and samples using biomedical clustering and clustering on the embedding space; (3) automated generation of a natural language question-answering dataset for cohorts and samples based on randomized combinations of standardized metadata dimensions; and (4) fine-tuning of a domain-specific embedder to optimize queries.
Score: 0.14999444543328289
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The growing volume of omics and clinical data generated for neurodegenerative diseases (NDs) requires new approaches for their curation so they can be ready-to-use in bioinformatics. NeuroEmbed is an approach for the engineering of semantically accurate embedding spaces to represent cohorts and samples. The NeuroEmbed method comprises four stages: (1) extraction of ND cohorts from public repositories; (2) semi-automated normalization and augmentation of metadata of cohorts and samples using biomedical ontologies and clustering on the embedding space; (3) automated generation of a natural language question-answering (QA) dataset for cohorts and samples based on randomized combinations of standardized metadata dimensions and (4) fine-tuning of a domain-specific embedder to optimize queries. We illustrate the approach using the GEO repository and the PubMedBERT pretrained embedder. Applying NeuroEmbed, we semantically indexed 2,801 repositories and 150,924 samples. Amongst many biology-relevant categories, we normalized more than 1,700 heterogeneous tissue labels from GEO into 326 unique ontology-aligned concepts and enriched annotations with new ontology-aligned terms, leading to a fold increase in size for the metadata terms between 2.7 and 20 fold. After fine-tuning PubMedBERT with the QA training data augmented with the enlarged metadata, the model increased its mean Retrieval Precision from 0.277 to 0.866 and its mean Percentile Rank from 0.355 to 0.896. The NeuroEmbed methodology for the creation of electronic catalogues of omics cohorts and samples will foster automated bioinformatic pipelines construction. The NeuroEmbed catalogue of cohorts and samples is available at https://github.com/JoseAdrian3/NeuroEmbed.

Related papers

NDAI-NeuroMAP: A Neuroscience-Specific Embedding Model for Domain-Specific Retrieval [1.5705429611931057]
NDAI-NeuroMAP is the first neuroscience-domain-specific dense vector embedding model engineered for high-precision information retrieval tasks.<n>We employ a sophisticated fine-tuning approach utilizing the FremyCompany/BioLORD-2023 foundation model.<n> Comprehensive evaluation on a held-out test dataset comprising approximately 24,000 neuroscience-specific queries demonstrates substantial performance improvements over state-of-the-art general-purpose embedding models.
arXiv Detail & Related papers (2025-07-04T06:28:53Z)
Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z)
A Meta-GNN approach to personalized seizure detection and classification [53.906130332172324]
We propose a personalized seizure detection and classification framework that quickly adapts to a specific patient from limited seizure samples. We train a Meta-GNN based classifier that learns a global model from a set of training patients. We show that our method outperforms the baselines by reaching 82.7% on accuracy and 82.08% on F1 score after only 20 iterations on new unseen patients.
arXiv Detail & Related papers (2022-11-01T14:12:58Z)
Integrative Imaging Informatics for Cancer Research: Workflow Automation for Neuro-oncology (I3CR-WANO) [0.12175619840081271]
We propose an artificial intelligence-based solution for the aggregation and processing of multisequence neuro-Oncology MRI data. Our end-to-end framework i) classifies MRI sequences using an ensemble classifier, ii) preprocesses the data in a reproducible manner, and iv) delineates tumor tissue subtypes. It is robust to missing sequences and adopts an expert-in-the-loop approach, where the segmentation results may be manually refined by radiologists.
arXiv Detail & Related papers (2022-10-06T18:23:42Z)
Deeper Clinical Document Understanding Using Relation Extraction [0.0]
We propose a text mining framework comprising of Named Entity Recognition (NER) and Relation Extraction (RE) models. We introduce two new RE model architectures -- an accuracy-optimized one based on BioBERT and a speed-optimized one utilizing crafted features over a Fully Connected Neural Network (FCNN) We show two practical applications of this framework -- for building a biomedical knowledge graph and for improving the accuracy of mapping entities to clinical codes.
arXiv Detail & Related papers (2021-12-25T17:14:13Z)
Score-based Generative Modeling in Latent Space [93.8985523558869]
Score-based generative models (SGMs) have recently demonstrated impressive results in terms of both sample quality and distribution coverage. Here, we propose the Latent Score-based Generative Model (LSGM), a novel approach that trains SGMs in a latent space. Moving from data to latent space allows us to train more expressive generative models, apply SGMs to non-continuous data, and learn smoother SGMs in a smaller space.
arXiv Detail & Related papers (2021-06-10T17:26:35Z)
Data Augmentation in High Dimensional Low Sample Size Setting Using a Geometry-Based Variational Autoencoder [0.1529342790344802]
We propose a new method to perform data augmentation in the High Dimensional Low Sample Size (HDLSS) setting using a geometry-based variational autoencoder. Our approach combines a proper latent space modeling of the VAE seen as a Riemannian manifold with a new generation scheme which produces more meaningful samples.
arXiv Detail & Related papers (2021-04-30T18:10:33Z)
G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers. We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
A Systematic Approach to Featurization for Cancer Drug Sensitivity Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.