GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction
- URL: http://arxiv.org/abs/2501.01930v1
- Date: Fri, 03 Jan 2025 18:02:50 GMT
- Title: GoBERT: Gene Ontology Graph Informed BERT for Universal Gene Function Prediction
- Authors: Yuwei Miao, Yuzhi Guo, Hehuan Ma, Jingquan Yan, Feng Jiang, Rui Liao, Junzhou Huang,
- Abstract summary: We propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT)
Two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions.
GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations.
- Score: 33.42820754967865
- License:
- Abstract: Exploring the functions of genes and gene products is crucial to a wide range of fields, including medical research, evolutionary biology, and environmental science. However, discovering new functions largely relies on expensive and exhaustive wet lab experiments. Existing methods of automatic function annotation or prediction mainly focus on protein function prediction with sequence, 3D-structures or protein family information. In this study, we propose to tackle the gene function prediction problem by exploring Gene Ontology graph and annotation with BERT (GoBERT) to decipher the underlying relationships among gene functions. Our proposed novel function prediction task utilizes existing functions as inputs and generalizes the function prediction to gene and gene products. Specifically, two pre-train tasks are designed to jointly train GoBERT to capture both explicit and implicit relations of functions. Neighborhood prediction is a self-supervised multi-label classification task that captures the explicit function relations. Specified masking and recovering task helps GoBERT in finding implicit patterns among functions. The pre-trained GoBERT possess the ability to predict novel functions for various gene and gene products based on known functional annotations. Extensive experiments, biological case studies, and ablation studies are conducted to demonstrate the superiority of our proposed GoBERT.
Related papers
- GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments [112.25067497985447]
We introduce BioDiscoveryAgent, an agent that designs new experiments, reasons about their outcomes, and efficiently navigates the hypothesis space to reach desired solutions.
BioDiscoveryAgent can uniquely design new experiments without the need to train a machine learning model.
It achieves an average of 21% improvement in predicting relevant genetic perturbations across six datasets.
arXiv Detail & Related papers (2024-05-27T19:57:17Z) - FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics [46.189419603576084]
FGBERT is a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware tokenizer.
It demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels.
arXiv Detail & Related papers (2024-02-24T13:13:17Z) - Graph-in-Graph Network for Automatic Gene Ontology Description
Generation [55.40404942182707]
We propose a novel task: GO term description generation.
This task aims to automatically generate a sentence that describes the function of a GO term belonging to one of three categories.
The proposed network introduces a two-layer graph: the first layer is a graph of GO terms where each node is also a graph (gene graph)
arXiv Detail & Related papers (2022-06-10T18:17:17Z) - Gene Function Prediction with Gene Interaction Networks: A Context Graph
Kernel Approach [24.234645183601998]
We propose to use a gene's context graph, i.e., the gene interaction network associated with the focal gene, to infer its functions.
In a kernel-based machine-learning framework, we design a context graph kernel to capture the information in context graphs.
arXiv Detail & Related papers (2022-04-22T02:54:01Z) - Feature extraction using Spectral Clustering for Gene Function
Prediction [0.4492444446637856]
This paper presents a novel in silico approach for to the annotation problem that combines cluster analysis and hierarchical multi-label classification.
The proposed approach is applied to a case study on Zea mays, one of the most dominant and productive crops in the world.
arXiv Detail & Related papers (2022-03-25T10:17:36Z) - Mining Functionally Related Genes with Semi-Supervised Learning [0.0]
We introduce a rich set of features and use them in conjunction with semisupervised learning approaches.
The framework of learning with positive and unlabeled examples (LPU) is shown to be especially appropriate for mining functionally related genes.
arXiv Detail & Related papers (2020-11-05T20:34:09Z) - Handling highly correlated genes in prediction analysis of genomic
studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models.
We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection.
Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv Detail & Related papers (2020-07-05T22:14:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.