Neural Distance Embeddings for Biological Sequences
- URL: http://arxiv.org/abs/2109.09740v1
- Date: Mon, 20 Sep 2021 17:30:58 GMT
- Title: Neural Distance Embeddings for Biological Sequences
- Authors: Gabriele Corso, Rex Ying, Michal P\'andy, Petar Veli\v{c}kovi\'c, Jure
Leskovec, Pietro Li\`o
- Abstract summary: We present NeuroSEED, a framework to embed sequences in geometric vector spaces.
We show the effectiveness of the hyperbolic space that captures the hierarchical structure and provides an average 22% reduction in embedding RMSE.
The proposed approaches display significant accuracy and/or runtime improvements on real-world datasets.
- Score: 43.07977514121458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The development of data-dependent heuristics and representations for
biological sequences that reflect their evolutionary distance is critical for
large-scale biological research. However, popular machine learning approaches,
based on continuous Euclidean spaces, have struggled with the discrete
combinatorial formulation of the edit distance that models evolution and the
hierarchical relationship that characterises real-world datasets. We present
Neural Distance Embeddings (NeuroSEED), a general framework to embed sequences
in geometric vector spaces, and illustrate the effectiveness of the hyperbolic
space that captures the hierarchical structure and provides an average 22%
reduction in embedding RMSE against the best competing geometry. The capacity
of the framework and the significance of these improvements are then
demonstrated devising supervised and unsupervised NeuroSEED approaches to
multiple core tasks in bioinformatics. Benchmarked with common baselines, the
proposed approaches display significant accuracy and/or runtime improvements on
real-world datasets. As an example for hierarchical clustering, the proposed
pretrained and from-scratch methods match the quality of competing baselines
with 30x and 15x runtime reduction, respectively.
Related papers
- Enhanced High-Dimensional Data Visualization through Adaptive Multi-Scale Manifold Embedding [0.7705234721762716]
We propose an Adaptive Multi-Scale Manifold Embedding (AMSME) algorithm.
By introducing ordinal distance, we demonstrate that ordinal distance overcomes the constraints of the curse of dimensionality in high-dimensional spaces.
Experimental results demonstrate that AMSME significantly preserves intra-cluster topological structures and improves inter-cluster separation on real-world datasets.
arXiv Detail & Related papers (2025-03-18T06:46:53Z) - RankByGene: Gene-Guided Histopathology Representation Learning Through Cross-Modal Ranking Consistency [11.813883157319381]
We propose a novel framework that aligns gene and image features using a ranking-based alignment loss.
To further enhance the alignment's stability, we employ self-supervised knowledge distillation with a teacher-student network architecture.
arXiv Detail & Related papers (2024-11-22T17:08:28Z) - How to Bridge Spatial and Temporal Heterogeneity in Link Prediction? A Contrastive Method [11.719027225797037]
We propose a novel textbfContrastive Learning-based textbfLink textbfPrediction model, textbfCLP.
Our mymodel consistently outperforms the state-of-the-art models, demonstrating an average improvement of 10.10%, 13.44% in terms of AUC and AP.
arXiv Detail & Related papers (2024-11-01T14:20:53Z) - PRAGA: Prototype-aware Graph Adaptive Aggregation for Spatial Multi-modal Omics Analysis [1.1619559582563954]
We propose a novel spatial multi-modal omics resolved framework, termed PRototype-Aware Graph Adaptative Aggregation for Spatial Multi-modal Omics Analysis (PRAGA)
PRAGA constructs a dynamic graph to capture latent semantic relations and comprehensively integrate spatial information and feature semantics.
The learnable graph structure can also denoise perturbations by learning cross-modal knowledge.
arXiv Detail & Related papers (2024-09-19T12:53:29Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Injecting Hierarchical Biological Priors into Graph Neural Networks for Flow Cytometry Prediction [1.7709249262395883]
This work explores injecting hierarchical prior knowledge into graph neural networks (GNNs) for single-cell multi-class classification of cellular data.
We propose our hierarchical plug-in method to be applied to several GNN models, namely, FCHC-GNN, and effectively designed to capture neighborhood information crucial for single-cell FC domain.
arXiv Detail & Related papers (2024-05-28T18:24:16Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Interpolation-based Correlation Reduction Network for Semi-Supervised
Graph Learning [49.94816548023729]
We propose a novel graph contrastive learning method, termed Interpolation-based Correlation Reduction Network (ICRN)
In our method, we improve the discriminative capability of the latent feature by enlarging the margin of decision boundaries.
By combining the two settings, we extract rich supervision information from both the abundant unlabeled nodes and the rare yet valuable labeled nodes for discnative representation learning.
arXiv Detail & Related papers (2022-06-06T14:26:34Z) - Learning Neural Causal Models with Active Interventions [83.44636110899742]
We introduce an active intervention-targeting mechanism which enables a quick identification of the underlying causal structure of the data-generating process.
Our method significantly reduces the required number of interactions compared with random intervention targeting.
We demonstrate superior performance on multiple benchmarks from simulated to real-world data.
arXiv Detail & Related papers (2021-09-06T13:10:37Z) - UNIK: A Unified Framework for Real-world Skeleton-based Action
Recognition [11.81043814295441]
We introduce UNIK, a novel skeleton-based action recognition method that is able to generalize across datasets.
To study the cross-domain generalizability of action recognition in real-world videos, we re-evaluate state-of-the-art approaches as well as the proposed UNIK.
Results show that the proposed UNIK, with pre-training on Posetics, generalizes well and outperforms state-of-the-art when transferred onto four target action classification datasets.
arXiv Detail & Related papers (2021-07-19T02:00:28Z) - Unsupervised Domain Adaptation in Person re-ID via k-Reciprocal
Clustering and Large-Scale Heterogeneous Environment Synthesis [76.46004354572956]
We introduce an unsupervised domain adaptation approach for person re-identification.
Experimental results show that the proposed ktCUDA and SHRED approach achieves an average improvement of +5.7 mAP in re-identification performance.
arXiv Detail & Related papers (2020-01-14T17:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.