Generalising sequence models for epigenome predictions with tissue and
assay embeddings
- URL: http://arxiv.org/abs/2308.11671v1
- Date: Tue, 22 Aug 2023 10:34:19 GMT
- Title: Generalising sequence models for epigenome predictions with tissue and
assay embeddings
- Authors: Jacob Deasy, Ron Schwessinger, Ferran Gonzalez, Stephen Young, Kim
Branson
- Abstract summary: We show that strong correlation can be achieved across a large range of experimental conditions by integrating tissue and assay embeddings into a Contextualised Genomic Network (CGN)
We exhibit the efficacy of our approach across a broad set of epigenetic profiles and provide the first insights into the effect of genetic variants on epigenetic sequence model training.
- Score: 1.9999259391104391
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Sequence modelling approaches for epigenetic profile prediction have recently
expanded in terms of sequence length, model size, and profile diversity.
However, current models cannot infer on many experimentally feasible tissue and
assay pairs due to poor usage of contextual information, limiting $\textit{in
silico}$ understanding of regulatory genomics. We demonstrate that strong
correlation can be achieved across a large range of experimental conditions by
integrating tissue and assay embeddings into a Contextualised Genomic Network
(CGN). In contrast to previous approaches, we enhance long-range sequence
embeddings with contextual information in the input space, rather than
expanding the output space. We exhibit the efficacy of our approach across a
broad set of epigenetic profiles and provide the first insights into the effect
of genetic variants on epigenetic sequence model training. Our general approach
to context integration exceeds state of the art in multiple settings while
employing a more rigorous validation procedure.
Related papers
- A Non-negative VAE:the Generalized Gamma Belief Network [49.970917207211556]
The gamma belief network (GBN) has demonstrated its potential for uncovering multi-layer interpretable latent representations in text data.
We introduce the generalized gamma belief network (Generalized GBN) in this paper, which extends the original linear generative model to a more expressive non-linear generative model.
We also propose an upward-downward Weibull inference network to approximate the posterior distribution of the latent variables.
arXiv Detail & Related papers (2024-08-06T18:18:37Z) - U-learning for Prediction Inference via Combinatory Multi-Subsampling: With Applications to LASSO and Neural Networks [5.587500517608073]
Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns.
We introduce a novel U-sampling approach via multi-sublearning for making ensemble predictions.
More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics.
We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies.
arXiv Detail & Related papers (2024-07-22T00:03:51Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Heterogeneous Transfer Learning for Building High-Dimensional Generalized Linear Models with Disparate Datasets [0.0]
We describe a transfer learning approach for building high-dimensional generalized linear models.
We use data from a main study with detailed information on all predictors and an external, potentially much larger, study that has a more limited set of predictors.
arXiv Detail & Related papers (2023-12-20T06:11:59Z) - Mutual Exclusivity Training and Primitive Augmentation to Induce
Compositionality [84.94877848357896]
Recent datasets expose the lack of the systematic generalization ability in standard sequence-to-sequence models.
We analyze this behavior of seq2seq models and identify two contributing factors: a lack of mutual exclusivity bias and the tendency to memorize whole examples.
We show substantial empirical improvements using standard sequence-to-sequence models on two widely-used compositionality datasets.
arXiv Detail & Related papers (2022-11-28T17:36:41Z) - Multi-modality fusion using canonical correlation analysis methods:
Application in breast cancer survival prediction from histology and genomics [16.537929113715432]
We study the use of canonical correlation analysis (CCA) and penalized variants of CCA for the fusion of two modalities.
We analytically show that, with known model parameters, posterior mean estimators that jointly use both modalities outperform arbitrary linear mixing of single modality posterior estimators in latent variable prediction.
arXiv Detail & Related papers (2021-11-27T21:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.