Horizon-wise Learning Paradigm Promotes Gene Splicing Identification
- URL: http://arxiv.org/abs/2406.11900v1
- Date: Sat, 15 Jun 2024 08:18:09 GMT
- Title: Horizon-wise Learning Paradigm Promotes Gene Splicing Identification
- Authors: Qi-Jie Li, Qian Sun, Shao-Qun Zhang,
- Abstract summary: We propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI)
The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor.
In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation.
- Score: 6.225959701339916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying gene splicing is a core and significant task confronted in modern collaboration between artificial intelligence and bioinformatics. Past decades have witnessed great efforts on this concern, such as the bio-plausible splicing pattern AT-CG and the famous SpliceAI. In this paper, we propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI). The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor. In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation, improving accuracy and efficiency. The experiments conducted on the real-world Human dataset show that our proposed H-GSI outperforms SpliceAI and achieves the best accuracy of 97.20\%. The source code is available from this link.
Related papers
- GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians [13.837406082703756]
We introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data.
GenoTEX provides annotated code and results for solving a wide range of gene identification problems.
We present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation.
arXiv Detail & Related papers (2024-06-21T17:55:24Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Histo-Genomic Knowledge Distillation For Cancer Prognosis From Histopathology Whole Slide Images [7.5123289730388825]
Genome-informed Hyper-Attention Network (G-HANet) is capable of effectively distilling histo-genomic knowledge during training.
Network comprises cross-modal associating branch (CAB) and hyper-attention survival branch (HSB)
arXiv Detail & Related papers (2024-03-15T06:20:09Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - DNA Sequence Classification with Compressors [0.0]
Our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis.
Not only does this method align with the current state-of-the-art in terms of accuracy, but it also offers a more resource-efficient alternative to traditional machine learning methods.
arXiv Detail & Related papers (2024-01-25T09:17:19Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Genetic InfoMax: Exploring Mutual Information Maximization in
High-Dimensional Imaging Genetics Studies [50.11449968854487]
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits.
Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS.
We introduce a trans-modal learning framework Genetic InfoMax (GIM) to address the specific challenges of GWAS.
arXiv Detail & Related papers (2023-09-26T03:59:21Z) - Cancer-inspired Genomics Mapper Model for the Generation of Synthetic
DNA Sequences with Desired Genomics Signatures [0.0]
Cancer-inspired genomics mapper model (CGMM) combines genetic algorithm (GA) and deep learning (DL) methods.
We demonstrate that CGMM can generate synthetic genomes of selected phenotypes such as ancestry and cancer.
arXiv Detail & Related papers (2023-05-01T07:16:40Z) - Deep metric learning improves lab of origin prediction of genetically
engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations.
We propose a method, based on metric learning, that ranks the most likely labs-of-origin.
We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.