Related papers: Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii

Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii

URL: http://arxiv.org/abs/2410.08867v1
Date: Fri, 11 Oct 2024 14:45:47 GMT
Title: Prediction by Machine Learning Analysis of Genomic Data Phenotypic Frost Tolerance in Perccottus glenii
Authors: Lilin Fan, Xuqing Chai, Zhixiong Tian, Yihang Qiao, Zhen Wang, Yifan Zhang,
Abstract summary: We will employ machine learning techniques to analyze the gene sequences of Perccottus glenii. We constructed four classification models: Random Forest, LightGBM, XGBoost, and Decision Tree. The dataset used by these classification models was extracted from the National Center for Biotechnology Information database.
Score: 7.412214379486083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Analysis of the genome sequence of Perccottus glenii, the only fish known to possess freeze tolerance, holds significant importance for understanding how organisms adapt to extreme environments, Traditional biological analysis methods are time-consuming and have limited accuracy, To address these issues, we will employ machine learning techniques to analyze the gene sequences of Perccottus glenii, with Neodontobutis hainanens as a comparative group, Firstly, we have proposed five gene sequence vectorization methods and a method for handling ultra-long gene sequences, We conducted a comparative study on the three vectorization methods: ordinal encoding, One-Hot encoding, and K-mer encoding, to identify the optimal encoding method, Secondly, we constructed four classification models: Random Forest, LightGBM, XGBoost, and Decision Tree, The dataset used by these classification models was extracted from the National Center for Biotechnology Information database, and we vectorized the sequence matrices using the optimal encoding method, K-mer, The Random Forest model, which is the optimal model, achieved a classification accuracy of up to 99, 98 , Lastly, we utilized SHAP values to conduct an interpretable analysis of the optimal classification model, Through ten-fold cross-validation and the AUC metric, we identified the top 10 features that contribute the most to the model's classification accuracy, This demonstrates that machine learning methods can effectively replace traditional manual analysis in identifying genes associated with the freeze tolerance phenotype in Perccottus glenii.

Related papers

Improving statistical learning methods via features selection without replacement sampling and random projection [0.680740878601496]
Cancer is a genetic disease characterized by genetic and epigenetic alterations that disrupt normal gene expression.<n>High-dimensional microarray datasets pose challenges for classification models due to the "small n, large p" problem.<n>This study contributes to cancer biomarker discovery, offering a robust computational method for analyzing microarray data.
arXiv Detail & Related papers (2025-05-28T22:36:46Z)
BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification [0.0937465283958018]
BOLIMES is a novel feature selection algorithm designed to enhance gene expression classification. It combines exhaustive feature selection with interpretability-driven refinement, offering a powerful solution for high-dimensional gene expression analysis.
arXiv Detail & Related papers (2025-02-18T17:33:41Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
A Robust Support Vector Machine Approach for Raman COVID-19 Data Classification [0.7864304771129751]
In this paper, we investigate the performance of a novel robust formulation for Support Vector Machine (SVM) in classifying COVID-19 samples obtained from Raman spectroscopy. We derive robust counterpart models of deterministic formulations using bounded-by-norm uncertainty sets around each observation. The effectiveness of our approach is validated on real-world COVID-19 datasets provided by Italian hospitals.
arXiv Detail & Related papers (2025-01-29T14:02:45Z)
Enhanced Gene Selection in Single-Cell Genomics: Pre-Filtering Synergy and Reinforced Optimization [16.491060073775884]
We introduce an iterative gene panel selection strategy applicable to clustering tasks in single-cell genomics. Our method integrates results from other gene selection algorithms, providing valuable preliminary boundaries. We incorporate the nature of the exploration process in reinforcement learning (RL) and its capability for continuous optimization.
arXiv Detail & Related papers (2024-06-11T16:21:33Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
DNA Sequence Classification with Compressors [0.0]
Our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis. Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods.
arXiv Detail & Related papers (2024-01-25T09:17:19Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
Deep metric learning improves lab of origin prediction of genetically engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations. We propose a method, based on metric learning, that ranks the most likely labs-of-origin. We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
A multi-stage machine learning model on diagnosis of esophageal manometry [50.591267188664666]
The framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage. This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data.
arXiv Detail & Related papers (2021-06-25T20:09:23Z)
Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z)
G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers. We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z)
Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem. Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples. Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z)
Low-Rank Reorganization via Proportional Hazards Non-negative Matrix Factorization Unveils Survival Associated Gene Clusters [9.773075235189525]
In this work, Cox proportional hazards regression is integrated with NMF by imposing survival constraints. Using human cancer gene expression data, the proposed technique can unravel critical clusters of cancer genes. The discovered gene clusters reflect rich biological implications and can help identify survival-related biomarkers.
arXiv Detail & Related papers (2020-08-09T17:59:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.