Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii
- URL: http://arxiv.org/abs/2410.08867v1
- Date: Fri, 11 Oct 2024 14:45:47 GMT
- Title: Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii
- Authors: Lilin Fan, Xuqing Chai, Zhixiong Tian, Yihang Qiao, Zhen Wang, Yifan Zhang,
- Abstract summary: We will employ machine learning techniques to analyze the gene sequences of Perccottus glenii.
We constructed four classification models: Random Forest, LightGBM, XGBoost, and Decision Tree.
The dataset used by these classification models was extracted from the National Center for Biotechnology Information database.
- Score: 7.412214379486083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Analysis of the genome sequence of Perccottus glenii, the only fish known to possess freeze tolerance, holds significant importance for understanding how organisms adapt to extreme environments, Traditional biological analysis methods are time-consuming and have limited accuracy, To address these issues, we will employ machine learning techniques to analyze the gene sequences of Perccottus glenii, with Neodontobutis hainanens as a comparative group, Firstly, we have proposed five gene sequence vectorization methods and a method for handling ultra-long gene sequences, We conducted a comparative study on the three vectorization methods: ordinal encoding, One-Hot encoding, and K-mer encoding, to identify the optimal encoding method, Secondly, we constructed four classification models: Random Forest, LightGBM, XGBoost, and Decision Tree, The dataset used by these classification models was extracted from the National Center for Biotechnology Information database, and we vectorized the sequence matrices using the optimal encoding method, K-mer, The Random Forest model, which is the optimal model, achieved a classification accuracy of up to 99, 98 , Lastly, we utilized SHAP values to conduct an interpretable analysis of the optimal classification model, Through ten-fold cross-validation and the AUC metric, we identified the top 10 features that contribute the most to the model's classification accuracy, This demonstrates that machine learning methods can effectively replace traditional manual analysis in identifying genes associated with the freeze tolerance phenotype in Perccottus glenii.
Related papers
- Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization [16.491060073775884]
We introduce an iterative gene panel selection strategy applicable to clustering tasks in single-cell genomics.
Our method integrates results from other gene selection algorithms, providing valuable preliminary boundaries.
We incorporate the nature of the exploration process in reinforcement learning (RL) and its capability for continuous optimization.
arXiv Detail & Related papers (2024-06-11T16:21:33Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - DNA Sequence Classification with Compressors [0.0]
Our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis.
Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods.
arXiv Detail & Related papers (2024-01-25T09:17:19Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - A multi-stage machine learning model on diagnosis of esophageal
manometry [50.591267188664666]
The framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage.
This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data.
arXiv Detail & Related papers (2021-06-25T20:09:23Z) - Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling.
We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem.
Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples.
Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z) - Low-Rank Reorganization via Proportional Hazards Non-negative Matrix
Factorization Unveils Survival Associated Gene Clusters [9.773075235189525]
In this work, Cox proportional hazards regression is integrated with NMF by imposing survival constraints.
Using human cancer gene expression data, the proposed technique can unravel critical clusters of cancer genes.
The discovered gene clusters reflect rich biological implications and can help identify survival-related biomarkers.
arXiv Detail & Related papers (2020-08-09T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.