Genomic Interpreter: A Hierarchical Genomic Deep Neural Network with 1D
Shifted Window Transformer
- URL: http://arxiv.org/abs/2306.05143v2
- Date: Wed, 28 Jun 2023 08:17:32 GMT
- Title: Genomic Interpreter: A Hierarchical Genomic Deep Neural Network with 1D
Shifted Window Transformer
- Authors: Zehui Li, Akashaditya Das, William A V Beardall, Yiren Zhao, Guy-Bart
Stan
- Abstract summary: Genomic Interpreter is a novel architecture for genomic assay prediction.
Model can identify hierarchical dependencies in genomic sites.
Evaluated on a dataset containing 38,171 DNA segments of 17K pairs.
- Score: 4.059849656394191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the increasing volume and quality of genomics data, extracting new
insights requires interpretable machine-learning models. This work presents
Genomic Interpreter: a novel architecture for genomic assay prediction. This
model outperforms the state-of-the-art models for genomic assay prediction
tasks. Our model can identify hierarchical dependencies in genomic sites. This
is achieved through the integration of 1D-Swin, a novel Transformer-based block
designed by us for modelling long-range hierarchical data. Evaluated on a
dataset containing 38,171 DNA segments of 17K base pairs, Genomic Interpreter
demonstrates superior performance in chromatin accessibility and gene
expression prediction and unmasks the underlying `syntax' of gene regulation.
Related papers
- Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics [35.47381119898764]
We introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer.
MGM and TEM-CL constitute our novel metagenomic language model NAME, pre-trained on 100 million metagenomic sequences.
arXiv Detail & Related papers (2024-02-24T13:13:17Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Self-Supervised Graph Representation Learning for Neuronal Morphologies [75.38832711445421]
We present GraphDINO, a data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled datasets.
We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings on par with manual feature-based classification by experts.
Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets.
arXiv Detail & Related papers (2021-12-23T12:17:47Z) - Epigenomic language models powered by Cerebras [0.0]
Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs.
We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task.
Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
arXiv Detail & Related papers (2021-12-14T17:23:42Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.