$\Gamma$-VAE: Curvature regularized variational autoencoders for
uncovering emergent low dimensional geometric structure in high dimensional
data
- URL: http://arxiv.org/abs/2403.01078v1
- Date: Sat, 2 Mar 2024 03:26:09 GMT
- Title: $\Gamma$-VAE: Curvature regularized variational autoencoders for
uncovering emergent low dimensional geometric structure in high dimensional
data
- Authors: Jason Z. Kim, Nicolas Perrin-Gilbert, Erkan Narmanli, Paul Klein,
Christopher R. Myers, Itai Cohen, Joshua J. Waterfall, James P. Sethna
- Abstract summary: Natural systems with emergent behaviors often organize along low-dimensional subsets of high-dimensional spaces.
We show that regularizing the curvature of generative models will enable more consistent, predictive, and generalizable models.
- Score: 0.25128687379089687
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Natural systems with emergent behaviors often organize along low-dimensional
subsets of high-dimensional spaces. For example, despite the tens of thousands
of genes in the human genome, the principled study of genomics is fruitful
because biological processes rely on coordinated organization that results in
lower dimensional phenotypes. To uncover this organization, many nonlinear
dimensionality reduction techniques have successfully embedded high-dimensional
data into low-dimensional spaces by preserving local similarities between data
points. However, the nonlinearities in these methods allow for too much
curvature to preserve general trends across multiple non-neighboring data
clusters, thereby limiting their interpretability and generalizability to
out-of-distribution data. Here, we address both of these limitations by
regularizing the curvature of manifolds generated by variational autoencoders,
a process we coin ``$\Gamma$-VAE''. We demonstrate its utility using two
example data sets: bulk RNA-seq from the The Cancer Genome Atlas (TCGA) and the
Genotype Tissue Expression (GTEx); and single cell RNA-seq from a lineage
tracing experiment in hematopoietic stem cell differentiation. We find that the
resulting regularized manifolds identify mesoscale structure associated with
different cancer cell types, and accurately re-embed tissues from completely
unseen, out-of distribution cancers as if they were originally trained on them.
Finally, we show that preserving long-range relationships to differentiated
cells separates undifferentiated cells -- which have not yet specialized --
according to their eventual fate. Broadly, we anticipate that regularizing the
curvature of generative models will enable more consistent, predictive, and
generalizable models in any high-dimensional system with emergent
low-dimensional behavior.
Related papers
- Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - sc-OTGM: Single-Cell Perturbation Modeling by Solving Optimal Mass Transport on the Manifold of Gaussian Mixtures [0.9674145073701153]
sc-OTGM is an unsupervised model grounded in the inductive bias that the scRNAseq data can be generated.
sc-OTGM is effective in cell state classification, aids in the analysis of differential gene expression, and ranks genes for target identification.
It also predicts the effects of single-gene perturbations on downstream gene regulation and generates synthetic scRNA-seq data conditioned on specific cell states.
arXiv Detail & Related papers (2024-05-06T06:46:11Z) - Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets.
In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem.
This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - StyleGenes: Discrete and Efficient Latent Distributions for GANs [149.0290830305808]
We propose a discrete latent distribution for Generative Adversarial Networks (GANs)
Instead of drawing latent vectors from a continuous prior, we sample from a finite set of learnable latents.
We take inspiration from the encoding of information in biological organisms.
arXiv Detail & Related papers (2023-04-30T23:28:46Z) - Learning Causal Representations of Single Cells via Sparse Mechanism
Shift Modeling [3.2435888122704037]
We propose a deep generative model of single-cell gene expression data for which each perturbation is treated as an intervention targeting an unknown, but sparse, subset of latent variables.
We benchmark these methods on simulated single-cell data to evaluate their performance at latent units recovery, causal target identification and out-of-domain generalization.
arXiv Detail & Related papers (2022-11-07T15:47:40Z) - Modelling Technical and Biological Effects in scRNA-seq data with
Scalable GPLVMs [6.708052194104378]
We extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets.
The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast variational inference.
arXiv Detail & Related papers (2022-09-14T15:25:15Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.