Redundancy-aware unsupervised rankings for collections of gene sets
- URL: http://arxiv.org/abs/2307.16182v1
- Date: Sun, 30 Jul 2023 09:39:42 GMT
- Title: Redundancy-aware unsupervised rankings for collections of gene sets
- Authors: Chiara Balestra, Carlo Maj, Emmanuel M\"uller, Andreas Mayr
- Abstract summary: We propose to use importance scores to rank the pathways in the collections studying the context from a set covering perspective.
The proposed method shows a practical utility in bioinformatics to increase the interpretability of the collections of gene sets.
- Score: 0.28675177318965034
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The biological roles of gene sets are used to group them into collections.
These collections are often characterized by being high-dimensional,
overlapping, and redundant families of sets, thus precluding a straightforward
interpretation and study of their content. Bioinformatics looked for solutions
to reduce their dimension or increase their intepretability. One possibility
lies in aggregating overlapping gene sets to create larger pathways, but the
modified biological pathways are hardly biologically justifiable. We propose to
use importance scores to rank the pathways in the collections studying the
context from a set covering perspective. The proposed Shapley values-based
scores consider the distribution of the singletons and the size of the sets in
the families; Furthermore, a trick allows us to circumvent the usual
exponential complexity of Shapley values' computation. Finally, we address the
challenge of including a redundancy awareness in the obtained rankings where,
in our case, sets are redundant if they show prominent intersections.
The rankings can be used to reduce the dimension of collections of gene sets,
such that they show lower redundancy and still a high coverage of the genes. We
further investigate the impact of our selection on Gene Sets Enrichment
Analysis. The proposed method shows a practical utility in bioinformatics to
increase the interpretability of the collections of gene sets and a step
forward to include redundancy into Shapley values computations.
Related papers
- You Only Train Once: Differentiable Subset Selection for Omics Data [16.72884554628602]
YOTO is an end-to-end framework that jointly identifies discrete gene subsets and performs prediction within a single differentiable architecture.<n>We evaluate YOTO on two representative single-cell RNA-seq datasets, showing that it consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-12-19T15:17:34Z) - Learning ON Large Datasets Using Bit-String Trees [0.0]
This thesis develops computational methods in similarity-preserving hashing, classification, and cancer genomics.<n>We introduce Compressed BST of Inverted hash tables (ComBI), which enables fast approximate nearest-neighbor search with reduced memory.<n>We show that GRAF and ComBI can be used to estimate per-sample classifiability, which enables scalable prediction of cancer patient survival.
arXiv Detail & Related papers (2025-08-23T16:49:42Z) - GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification [0.0937465283958018]
BOLIMES is a novel feature selection algorithm designed to enhance gene expression classification.
It combines exhaustive feature selection with interpretability-driven refinement, offering a powerful solution for high-dimensional gene expression analysis.
arXiv Detail & Related papers (2025-02-18T17:33:41Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - Enhancing Neural Subset Selection: Integrating Background Information into Set Representations [53.15923939406772]
We show that when the target value is conditioned on both the input set and subset, it is essential to incorporate an textitinvariant sufficient statistic of the superset into the subset of interest.
This ensures that the output value remains invariant to permutations of the subset and its corresponding superset, enabling identification of the specific superset from which the subset originated.
arXiv Detail & Related papers (2024-02-05T16:09:35Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - StyleGenes: Discrete and Efficient Latent Distributions for GANs [149.0290830305808]
We propose a discrete latent distribution for Generative Adversarial Networks (GANs)
Instead of drawing latent vectors from a continuous prior, we sample from a finite set of learnable latents.
We take inspiration from the encoding of information in biological organisms.
arXiv Detail & Related papers (2023-04-30T23:28:46Z) - Redundancy-aware unsupervised ranking based on game theory --
application to gene enrichment analysis [0.28675177318965034]
We propose a method to rank sets within a family of sets based on the distribution of the singletons and their size.
We evaluate our approach for gene sets collections; the rankings obtained show low redundancy and high coverage of the genes.
arXiv Detail & Related papers (2022-07-22T08:57:08Z) - Unsupervised Features Ranking via Coalitional Game Theory for
Categorical Data [0.28675177318965034]
Unsupervised feature selection aims to reduce the number of features.
We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate.
arXiv Detail & Related papers (2022-05-17T14:17:36Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Object-Attribute Biclustering for Elimination of Missing Genotypes in
Ischemic Stroke Genome-Wide Data [2.0236506875465863]
Missing genotypes can affect the efficacy of machine learning approaches to identify the risk genetic variants of common diseases and traits.
The problem occurs when genotypic data are collected from different experiments with different DNA microarrays, each being characterised by its pattern of uncalled (missing) genotypes.
We use well-developed notions of object-attribute biclusters and formal concepts that correspond to dense subrelations in the binary relation.
arXiv Detail & Related papers (2020-10-22T12:27:43Z) - A Novel Granular-Based Bi-Clustering Method of Deep Mining the
Co-Expressed Genes [76.84066556597342]
Bi-clustering methods are used to mine bi-clusters whose subsets of samples (genes) are co-regulated under their test conditions.
Unfortunately, traditional bi-clustering methods are not fully effective in discovering such bi-clusters.
We propose a novel bi-clustering method by involving here the theory of Granular Computing.
arXiv Detail & Related papers (2020-05-12T02:04:40Z) - Learn to Predict Sets Using Feed-Forward Neural Networks [63.91494644881925]
This paper addresses the task of set prediction using deep feed-forward neural networks.
We present a novel approach for learning to predict sets with unknown permutation and cardinality.
We demonstrate the validity of our set formulations on relevant vision problems.
arXiv Detail & Related papers (2020-01-30T01:52:07Z) - Guiding Corpus-based Set Expansion by Auxiliary Sets Generation and
Co-Expansion [45.716171458483636]
corpus-based set expansion algorithms bootstrap the given seeds by incorporating lexical patterns and distributional similarity.
Set-CoExpan automatically generates auxiliary sets as negative sets that are closely related to the target set of user's interest.
We show that Set-CoExpan outperforms strong baseline methods significantly.
arXiv Detail & Related papers (2020-01-27T22:34:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.