Natural language processing for clusterization of genes according to
their functions
- URL: http://arxiv.org/abs/2207.08162v1
- Date: Sun, 17 Jul 2022 12:59:34 GMT
- Title: Natural language processing for clusterization of genes according to
their functions
- Authors: Vladislav Dordiuk, Ekaterina Demicheva, Fernando Polanco Espino,
Konstantin Ushenin
- Abstract summary: We propose an approach that reduces the analysis of several thousand genes to analysis of several clusters.
The descriptions are encoded as vectors using the pretrained language model (BERT) and some text processing approaches.
- Score: 62.997667081978825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are hundreds of methods for analysis of data obtained in
mRNA-sequencing. The most of them are focused on small number of genes. In this
study, we propose an approach that reduces the analysis of several thousand
genes to analysis of several clusters. The list of genes is enriched with
information from open databases. Then, the descriptions are encoded as vectors
using the pretrained language model (BERT) and some text processing approaches.
The encoded gene function pass through the dimensionality reduction and
clusterization. Aiming to find the most efficient pipeline, 180 cases of
pipeline with different methods in the major pipeline steps were analyzed. The
performance was evaluated with clusterization indexes and expert review of the
results.
Related papers
- Robust Multi-view Co-expression Network Inference [8.697303234009528]
Inferring gene co-expression networks from transcriptome data presents many challenges.
We introduce a robust method for high-dimensional graph inference from multiple independent studies.
arXiv Detail & Related papers (2024-09-30T06:30:09Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - SGC: A semi-supervised pipeline for gene clustering using self-training
approach in gene co-expression networks [3.8073142980733]
We propose a novel pipeline for gene clustering based on mathematics of spectral network theory.
SGC consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner.
We show that SGC results in higher enrichment in real data.
arXiv Detail & Related papers (2022-09-21T14:51:08Z) - Comprehensive survey of computational learning methods for analysis of
gene expression data in genomics [7.717214217542406]
Computational analysis methods including machine learning have a significant impact in the fields of genomics and medicine.
In this review, we compile various statistical and computational tools used in analysis of expression microarray data.
We specifically discuss methods for missing value (gene expression) imputation, feature gene scaling, selection and extraction of features for dimensionality reduction, and learning and analysis of expression data.
arXiv Detail & Related papers (2022-02-07T05:53:13Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Exploiting Language Model for Efficient Linguistic Steganalysis: An
Empirical Study [23.311007481830647]
We present two methods to efficient linguistic steganalysis.
One is to pre-train a language model based on RNN, and the other is to pre-train a sequence autoencoder.
arXiv Detail & Related papers (2021-07-26T12:37:18Z) - Rissanen Data Analysis: Examining Dataset Characteristics via
Description Length [78.42578316883271]
We introduce a method to determine if a certain capability helps to achieve an accurate model of given data.
Since minimum program length is uncomputable, we estimate the labels' minimum description length (MDL) as a proxy.
We call the method Rissanen Data Analysis (RDA) after the father of MDL.
arXiv Detail & Related papers (2021-03-05T18:58:32Z) - Mining Functionally Related Genes with Semi-Supervised Learning [0.0]
We introduce a rich set of features and use them in conjunction with semisupervised learning approaches.
The framework of learning with positive and unlabeled examples (LPU) is shown to be especially appropriate for mining functionally related genes.
arXiv Detail & Related papers (2020-11-05T20:34:09Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.