Mining Functionally Related Genes with Semi-Supervised Learning
- URL: http://arxiv.org/abs/2011.03089v1
- Date: Thu, 5 Nov 2020 20:34:09 GMT
- Title: Mining Functionally Related Genes with Semi-Supervised Learning
- Authors: Kaiyu Shen, Razvan Bunescu and Sarah E. Wyatt
- Abstract summary: We introduce a rich set of features and use them in conjunction with semisupervised learning approaches.
The framework of learning with positive and unlabeled examples (LPU) is shown to be especially appropriate for mining functionally related genes.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The study of biological processes can greatly benefit from tools that
automatically predict gene functions or directly cluster genes based on shared
functionality. Existing data mining methods predict protein functionality by
exploiting data obtained from high-throughput experiments or meta-scale
information from public databases. Most existing prediction tools are targeted
at predicting protein functions that are described in the gene ontology (GO).
However, in many cases biologists wish to discover functionally related genes
for which GO terms are inadequate. In this paper, we introduce a rich set of
features and use them in conjunction with semisupervised learning approaches in
order to expand an initial set of seed genes to a larger cluster of
functionally related genes. Among all the semi-supervised methods that were
evaluated, the framework of learning with positive and unlabeled examples (LPU)
is shown to be especially appropriate for mining functionally related genes.
When evaluated on experimentally validated benchmark data, the LPU approaches1
significantly outperform a standard supervised learning algorithm as well as an
established state-of-the-art method. Given an initial set of seed genes, our
best performing approach could be used to mine functionally related genes in a
wide range of organisms.
Related papers
- BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Gene Set Summarization using Large Language Models [1.312659265502151]
We develop a method that uses GPT models to perform gene set function summarization.
We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets.
However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant.
arXiv Detail & Related papers (2023-05-21T02:06:33Z) - Machine Learning Methods for Cancer Classification Using Gene Expression
Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases.
Gene expression can play a fundamental role in the early detection of cancer.
This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z) - Natural language processing for clusterization of genes according to
their functions [62.997667081978825]
We propose an approach that reduces the analysis of several thousand genes to analysis of several clusters.
The descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches.
arXiv Detail & Related papers (2022-07-17T12:59:34Z) - Hierarchy exploitation to detect missing annotations on hierarchical
multi-label classification [0.1749935196721634]
We present a method to detect missing annotations in hierarchical multi-label classification datasets.
We propose a method that exploits the class hierarchy by computing aggregated probabilities to the paths of classes from the leaves to the root for each instance.
The experiments on Oriza sativa Japonica, a variety of rice, showcase that incorporating the hierarchy of classes into the method often improves the predictive performance.
arXiv Detail & Related papers (2022-07-13T14:32:50Z) - Gene Function Prediction with Gene Interaction Networks: A Context Graph
Kernel Approach [24.234645183601998]
We propose to use a gene's context graph, i.e., the gene interaction network associated with the focal gene, to infer its functions.
In a kernel-based machine-learning framework, we design a context graph kernel to capture the information in context graphs.
arXiv Detail & Related papers (2022-04-22T02:54:01Z) - Feature extraction using Spectral Clustering for Gene Function
Prediction [0.4492444446637856]
This paper presents a novel in silico approach for to the annotation problem that combines cluster analysis and hierarchical multi-label classification.
The proposed approach is applied to a case study on Zea mays, one of the most dominant and productive crops in the world.
arXiv Detail & Related papers (2022-03-25T10:17:36Z) - Handling highly correlated genes in prediction analysis of genomic
studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models.
We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection.
Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv Detail & Related papers (2020-07-05T22:14:03Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z) - A Novel Granular-Based Bi-Clustering Method of Deep Mining the
Co-Expressed Genes [76.84066556597342]
Bi-clustering methods are used to mine bi-clusters whose subsets of samples (genes) are co-regulated under their test conditions.
Unfortunately, traditional bi-clustering methods are not fully effective in discovering such bi-clusters.
We propose a novel bi-clustering method by involving here the theory of Granular Computing.
arXiv Detail & Related papers (2020-05-12T02:04:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.