InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference
- URL: http://arxiv.org/abs/2503.04483v1
- Date: Thu, 06 Mar 2025 14:32:00 GMT
- Title: InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference
- Authors: Tianyu Cui, Song-Jun Xu, Artem Moskalev, Shuwei Li, Tommaso Mansi, Mangal Prakash, Rui Liao,
- Abstract summary: Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes.<n>We introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors.<n>We propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery.
- Score: 6.17096244556794
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases, such as class imbalances of GT interactions, rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors.
Related papers
- GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - GFM-RAG: Graph Foundation Model for Retrieval Augmented Generation [84.41557981816077]
We introduce GFM-RAG, a novel graph foundation model (GFM) for retrieval augmented generation.
GFM-RAG is powered by an innovative graph neural network that reasons over graph structure to capture complex query-knowledge relationships.
It achieves state-of-the-art performance while maintaining efficiency and alignment with neural scaling laws.
arXiv Detail & Related papers (2025-02-03T07:04:29Z) - Gene Regulatory Network Inference in the Presence of Selection Bias and Latent Confounders [14.626706466908386]
Gene Regulatory Network Inference (GRNI) aims to identify causal relationships among genes using gene expression data.<n>Gene expression is influenced by latent confounders, such as non-coding RNAs, which add complexity to GRNI.<n>We propose GISL (Gene Regulatory Network Inference in the presence of Selection bias and Latent confounders) to infer true regulatory relationships in the presence of selection and confounding issues.
arXiv Detail & Related papers (2025-01-17T11:27:58Z) - Knowledge-Guided Biomarker Identification for Label-Free Single-Cell RNA-Seq Data: A Reinforcement Learning Perspective [24.247247851943982]
We present an iterative gene panel selection strategy that harnesses ensemble knowledge from existing gene selection algorithms to establish preliminary boundaries or prior knowledge.<n>We incorporate reinforcement learning through a reward function shaped by expert behavior, enabling dynamic refinement and targeted selection of gene panels.<n>Our results underscore the potential of this approach to advance single-cell genomics data analysis.
arXiv Detail & Related papers (2025-01-02T07:57:41Z) - UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models [88.16197692794707]
UniGen is a comprehensive framework designed to produce diverse, accurate, and highly controllable datasets.
To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature.
Extensive experiments demonstrate the superior quality of data generated by UniGen.
arXiv Detail & Related papers (2024-06-27T07:56:44Z) - Horizon-wise Learning Paradigm Promotes Gene Splicing Identification [6.225959701339916]
We propose a novel framework for the task of gene splicing identification, named Horizon-wise Gene Splicing Identification (H-GSI)
The proposed H-GSI follows the horizon-wise identification paradigm and comprises four components: the pre-processing procedure transforming string data into tensors, the sliding window technique handling long sequences, the SeqLab model, and the predictor.
In contrast to existing studies that process gene information with a truncated fixed-length sequence, H-GSI employs a horizon-wise identification paradigm in which all positions in a sequence are predicted with only one forward computation.
arXiv Detail & Related papers (2024-06-15T08:18:09Z) - Bt-GAN: Generating Fair Synthetic Healthdata via Bias-transforming Generative Adversarial Networks [3.3903891679981593]
We present Bias-transforming Generative Adversarial Networks (Bt-GAN), a GAN-based synthetic data generator specifically designed for the healthcare domain.
Our results demonstrate that Bt-GAN achieves SOTA accuracy while significantly improving fairness and minimizing bias.
arXiv Detail & Related papers (2024-04-21T12:16:38Z) - Exploring Sparsity in Graph Transformers [67.48149404841925]
Graph Transformers (GTs) have achieved impressive results on various graph-related tasks.
However, the huge computational cost of GTs hinders their deployment and application, especially in resource-constrained environments.
We propose a comprehensive textbfGraph textbfTransformer textbfSParsification (GTSP) framework that helps to reduce the computational complexity of GTs.
arXiv Detail & Related papers (2023-12-09T06:21:44Z) - Biomedical knowledge graph-optimized prompt generation for large language models [1.6658478064349376]
Large Language Models (LLMs) are being adopted at an unprecedented rate, yet still face challenges in knowledge-intensive domains like biomedicine.
Here, we introduce a token-optimized and robust Knowledge Graph-based Retrieval Augmented Generation framework.
arXiv Detail & Related papers (2023-11-29T03:07:00Z) - Genetic InfoMax: Exploring Mutual Information Maximization in
High-Dimensional Imaging Genetics Studies [50.11449968854487]
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits.
Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS.
We introduce a trans-modal learning framework Genetic InfoMax (GIM) to address the specific challenges of GWAS.
arXiv Detail & Related papers (2023-09-26T03:59:21Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.