Assigning Species Information to Corresponding Genes by a Sequence
Labeling Framework
- URL: http://arxiv.org/abs/2205.03853v1
- Date: Sun, 8 May 2022 12:39:45 GMT
- Title: Assigning Species Information to Corresponding Genes by a Sequence
Labeling Framework
- Authors: Ling Luo, Chih-Hsuan Wei, Po-Ting Lai, Qingyu Chen, Rezarta Islamaj
Do\u{g}an, Zhiyong Lu
- Abstract summary: Existing methods typically rely on rules based on gene and species co-occurrence in the article.
We develop a high-performance method, using a novel deep learning-based framework, to classify whether there is a relation between a gene and a species.
Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method.
- Score: 7.231921004060877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The automatic assignment of species information to the corresponding genes in
a research article is a critically important step in the gene normalization
task, whereby a gene mention is normalized and linked to a database record or
identifier by a text-mining algorithm. Existing methods typically rely on
heuristic rules based on gene and species co-occurrence in the article, but
their accuracy is suboptimal. We therefore developed a high-performance method,
using a novel deep learning-based framework, to classify whether there is a
relation between a gene and a species. Instead of the traditional binary
classification framework in which all possible pairs of genes and species in
the same article are evaluated, we treat the problem as a sequence-labeling
task such that only a fraction of the pairs needs to be considered. Our
benchmarking results show that our approach obtains significantly higher
performance compared to that of the rule-based baseline method for the species
assignment task (from 65.8% to 81.3% in accuracy). The source code and data for
species assignment are freely available at
https://github.com/ncbi/SpeciesAssignment.
Related papers
- VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Feature Selection via Robust Weighted Score for High Dimensional Binary
Class-Imbalanced Gene Expression Data [1.2891210250935148]
A robust weighted score for unbalanced data (ROWSU) is proposed for selecting the most discriminative feature for high dimensional gene expression binary classification with class-imbalance problem.
The performance of the proposed ROWSU method is evaluated on $6$ gene expression datasets.
arXiv Detail & Related papers (2024-01-23T11:22:03Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - Optirank: classification for RNA-Seq data with optimal ranking reference
genes [0.0]
We propose a logistic regression model, optirank, which learns simultaneously the parameters of the model and the genes to use as a reference set in the ranking.
We also consider real classification tasks, which present different kinds of distribution shifts between train and test data.
arXiv Detail & Related papers (2023-01-11T10:49:06Z) - Hierarchy exploitation to detect missing annotations on hierarchical
multi-label classification [0.1749935196721634]
We present a method to detect missing annotations in hierarchical multi-label classification datasets.
We propose a method that exploits the class hierarchy by computing aggregated probabilities to the paths of classes from the leaves to the root for each instance.
The experiments on Oriza sativa Japonica, a variety of rice, showcase that incorporating the hierarchy of classes into the method often improves the predictive performance.
arXiv Detail & Related papers (2022-07-13T14:32:50Z) - Multivariate feature ranking of gene expression data [62.997667081978825]
We propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency.
We statistically prove that the proposed methods outperform the state of the art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance.
arXiv Detail & Related papers (2021-11-03T17:19:53Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - Cancer Gene Profiling through Unsupervised Discovery [49.28556294619424]
We introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers.
Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm.
Our signature reports promising results on distinguishing immune inflammatory and immune desert tumors.
arXiv Detail & Related papers (2021-02-11T09:04:45Z) - Mining Functionally Related Genes with Semi-Supervised Learning [0.0]
We introduce a rich set of features and use them in conjunction with semisupervised learning approaches.
The framework of learning with positive and unlabeled examples (LPU) is shown to be especially appropriate for mining functionally related genes.
arXiv Detail & Related papers (2020-11-05T20:34:09Z) - A New Gene Selection Algorithm using Fuzzy-Rough Set Theory for Tumor
Classification [0.0]
We present a new technique for gene selection using a discernibility matrix of fuzzy-rough sets.
The proposed technique takes into account the similarity of those instances that have the same and different class labels to improve the gene selection results.
Experimental results demonstrate that this technique provides better efficiency compared to the state-of-the-art approaches.
arXiv Detail & Related papers (2020-03-26T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.