Optirank: classification for RNA-Seq data with optimal ranking reference
genes
- URL: http://arxiv.org/abs/2301.04653v1
- Date: Wed, 11 Jan 2023 10:49:06 GMT
- Title: Optirank: classification for RNA-Seq data with optimal ranking reference
genes
- Authors: Paola Malsot (1), Filipe Martins (1), Didier Trono (1), Guillaume
Obozinski (1, 2 and 3) ((1) Ecole Polytechnique F\'ed\'erale de Lausanne, (2)
Swiss Data Science Center, (3) ETH Z\"urich)
- Abstract summary: We propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking.
We also consider real classification tasks, which present different kinds of distribution shifts between train and test data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classification algorithms using RNA-Sequencing (RNA-Seq) data as input are
used in a variety of biological applications. By nature, RNA-Seq data is
subject to uncontrolled fluctuations both within and especially across
datasets, which presents a major difficulty for a trained classifier to
generalize to an external dataset. Replacing raw gene counts with the rank of
gene counts inside an observation has proven effective to mitigate this
problem. However, the rank of a feature is by definition relative to all other
features, including highly variable features that introduce noise in the
ranking. To address this problem and obtain more robust ranks, we propose a
logistic regression model, optirank, which learns simultaneously the parameters
of the model and the genes to use as a reference set in the ranking. We show
the effectiveness of this method on simulated data. We also consider real
classification tasks, which present different kinds of distribution shifts
between train and test data. Those tasks concern a variety of applications,
such as cancer of unknown primary classification, identification of specific
gene signatures, and determination of cell type in single-cell RNA-Seq
datasets. On those real tasks, optirank performs at least as well as the
vanilla logistic regression on classical ranks, while producing sparser
solutions. In addition, to increase the robustness against dataset shifts, we
propose a multi-source learning scheme and demonstrate its effectiveness when
used in combination with rank-based classifiers.
Related papers
- Symmetry Discovery for Different Data Types [52.2614860099811]
Equivariant neural networks incorporate symmetries into their architecture, achieving higher generalization performance.
We propose LieSD, a method for discovering symmetries via trained neural networks which approximate the input-output mappings of the tasks.
We validate the performance of LieSD on tasks with symmetries such as the two-body problem, the moment of inertia matrix prediction, and top quark tagging.
arXiv Detail & Related papers (2024-10-13T13:39:39Z) - Exploring Beyond Logits: Hierarchical Dynamic Labeling Based on Embeddings for Semi-Supervised Classification [49.09505771145326]
We propose a Hierarchical Dynamic Labeling (HDL) algorithm that does not depend on model predictions and utilizes image embeddings to generate sample labels.
Our approach has the potential to change the paradigm of pseudo-label generation in semi-supervised learning.
arXiv Detail & Related papers (2024-04-26T06:00:27Z) - Feature Selection via Robust Weighted Score for High Dimensional Binary
Class-Imbalanced Gene Expression Data [1.2891210250935148]
A robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem.
The performance of the proposed ROWSU method is evaluated on $6$ gene expression datasets.
arXiv Detail & Related papers (2024-01-23T11:22:03Z) - scHyena: Foundation Model for Full-Length Single-Cell RNA-Seq Analysis
in Brain [46.39828178736219]
We introduce scHyena, a foundation model designed to address these challenges and enhance the accuracy of scRNA-seq analysis in the brain.
scHyena is equipped with a linear adaptor layer, the positional encoding via gene-embedding, and a bidirectional Hyena operator.
This enables us to process full-length scRNA-seq data without losing any information from the raw data.
arXiv Detail & Related papers (2023-10-04T10:30:08Z) - Fast and Functional Structured Data Generators Rooted in
Out-of-Equilibrium Physics [62.997667081978825]
We address the challenge of using energy-based models to produce high-quality, label-specific data in structured datasets.
Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing.
We use a novel training algorithm that exploits non-equilibrium effects.
arXiv Detail & Related papers (2023-07-13T15:08:44Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - Improving the quality of generative models through Smirnov
transformation [1.3492000366723798]
We propose a novel activation function to be used as output of the generator agent.
It is based on the Smirnov probabilistic transformation and it is specifically designed to improve the quality of the generated data.
arXiv Detail & Related papers (2021-10-29T17:01:06Z) - A systematic evaluation of methods for cell phenotype classification
using single-cell RNA sequencing data [7.62849213621469]
This study evaluates 13 popular supervised machine learning algorithms to classify cell phenotypes.
The study outcomes showed that ElasticNet with interactions performed best in small and medium data sets.
arXiv Detail & Related papers (2021-10-01T23:24:15Z) - Rank-R FNN: A Tensor-Based Learning Model for High-Order Data
Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters.
First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension.
We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z) - Approximate kNN Classification for Biomedical Data [1.1852406625172218]
Single-cell RNA-seq (scRNA-seq) is an emerging DNA sequencing technology with promising capabilities but significant computational challenges.
We propose the utilization of approximate nearest neighbor search algorithms for the task of kNN classification in scRNA-seq data.
arXiv Detail & Related papers (2020-12-03T18:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.