Can sparse autoencoders make sense of latent representations?
- URL: http://arxiv.org/abs/2410.11468v2
- Date: Mon, 03 Feb 2025 18:20:35 GMT
- Title: Can sparse autoencoders make sense of latent representations?
- Authors: Viktoria Schuster,
- Abstract summary: We find that latent representations can encode observable and directly connected hidden variables in superposition.
Applying to single-cell multi-omics data, we show that SAEs can uncover key biological processes.
- Score: 0.0
- License:
- Abstract: Sparse autoencoders (SAEs) have lately been used to uncover interpretable latent features in large language models. Here, we explore their potential for decomposing latent representations in complex and high-dimensional biological data, where the underlying variables are often unknown. Using simulated data, we find that latent representations can encode observable and directly connected upstream hidden variables in superposition. The degree to which they are learned depends on the type of variable and the model architecture, favoring shallow and wide networks. Superpositions, however, are not identifiable if the generative variables are unknown. SAEs can recover these variables and their structure with respect to the observables. Applied to single-cell multi-omics data, we show that SAEs can uncover key biological processes. We further present an automated method for linking SAE features to biological concepts to enable large-scale analysis of single-cell expression models.
Related papers
- PolSAM: Polarimetric Scattering Mechanism Informed Segment Anything Model [76.95536611263356]
PolSAR data presents unique challenges due to its rich and complex characteristics.
Existing data representations, such as complex-valued data, polarimetric features, and amplitude images, are widely used.
Most feature extraction networks for PolSAR are small, limiting their ability to capture features effectively.
We propose the Polarimetric Scattering Mechanism-Informed SAM (PolSAM), an enhanced Segment Anything Model (SAM) that integrates domain-specific scattering characteristics and a novel prompt generation strategy.
arXiv Detail & Related papers (2024-12-17T09:59:53Z) - States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly [72.24742240125369]
In this paper, we uncover the intrinsic ability to perform extended sequences of calculations without relying on chain-of-thought step-by-step solutions.
Remarkably, the most advanced models can directly output the results of two-digit number additions with lengths extending up to 15 addends.
arXiv Detail & Related papers (2024-07-16T06:27:22Z) - Posterior Collapse and Latent Variable Non-identifiability [54.842098835445]
We propose a class of latent-identifiable variational autoencoders, deep generative models which enforce identifiability without sacrificing flexibility.
Across synthetic and real datasets, latent-identifiable variational autoencoders outperform existing methods in mitigating posterior collapse and providing meaningful representations of the data.
arXiv Detail & Related papers (2023-01-02T06:16:56Z) - Learning Causal Representations of Single Cells via Sparse Mechanism
Shift Modeling [3.2435888122704037]
We propose a deep generative model of single-cell gene expression data for which each perturbation is treated as an intervention targeting an unknown, but sparse, subset of latent variables.
We benchmark these methods on simulated single-cell data to evaluate their performance at latent units recovery, causal target identification and out-of-domain generalization.
arXiv Detail & Related papers (2022-11-07T15:47:40Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data
for Cancer Type Classification [4.992154875028543]
Integration and analysis of multi-omics data give us a broad view of tumours, which can improve clinical decision making.
SubOmiEmbed produces comparable results to the baseline OmiEmbed with a much smaller network and by using just a subset of the data.
This work can be improved to integrate mutation-based genomic data as well.
arXiv Detail & Related papers (2022-02-03T16:39:09Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Identifiable Variational Autoencoders via Sparse Decoding [37.30831737046145]
We develop the Sparse VAE, a deep generative model for unsupervised representation learning on high-dimensional data.
We first show that the Sparse VAE is identifiable: given data drawn from the model, there exists a uniquely optimal set of factors.
We empirically study the Sparse VAE with both simulated and real data.
arXiv Detail & Related papers (2021-10-20T22:11:33Z) - Encoding Domain Information with Sparse Priors for Inferring Explainable
Latent Variables [2.8935588665357077]
We propose spex-LVM, a factorial latent variable model with sparse priors to encourage the inference of explainable factors.
spex-LVM utilizes existing knowledge of curated biomedical pathways to automatically assign annotated attributes to latent factors.
Evaluations on simulated and real single-cell RNA-seq datasets demonstrate that our model robustly identifies relevant structure in an inherently explainable manner.
arXiv Detail & Related papers (2021-07-08T10:19:32Z) - Category-Learning with Context-Augmented Autoencoder [63.05016513788047]
Finding an interpretable non-redundant representation of real-world data is one of the key problems in Machine Learning.
We propose a novel method of using data augmentations when training autoencoders.
We train a Variational Autoencoder in such a way, that it makes transformation outcome predictable by auxiliary network.
arXiv Detail & Related papers (2020-10-10T14:04:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.