Disentangling Genotype and Environment Specific Latent Features for Improved Trait Prediction using a Compositional Autoencoder
- URL: http://arxiv.org/abs/2410.19922v1
- Date: Fri, 25 Oct 2024 18:30:27 GMT
- Title: Disentangling Genotype and Environment Specific Latent Features for Improved Trait Prediction using a Compositional Autoencoder
- Authors: Anirudha Powadi, Talukder Zaki Jubery, Michael C. Tross, James C. Schnable, Baskar Ganapathysubramanian,
- Abstract summary: This study introduces a compositional autoencoder framework to improve trait prediction in plant breeding and genetics programs.
By disentangling latent features, the CAE provides powerful tool for precision breeding and genetic research.
- Score: 1.137896937254823
- License:
- Abstract: This study introduces a compositional autoencoder (CAE) framework designed to disentangle the complex interplay between genotypic and environmental factors in high-dimensional phenotype data to improve trait prediction in plant breeding and genetics programs. Traditional predictive methods, which use compact representations of high-dimensional data through handcrafted features or latent features like PCA or more recently autoencoders, do not separate genotype-specific and environment-specific factors. We hypothesize that disentangling these features into genotype-specific and environment-specific components can enhance predictive models. To test this, we developed a compositional autoencoder (CAE) that decomposes high-dimensional data into distinct genotype-specific and environment-specific latent features. Our CAE framework employs a hierarchical architecture within an autoencoder to effectively separate these entangled latent features. Applied to a maize diversity panel dataset, the CAE demonstrates superior modeling of environmental influences and 5-10 times improved predictive performance for key traits like Days to Pollen and Yield, compared to the traditional methods, including standard autoencoders, PCA with regression, and Partial Least Squares Regression (PLSR). By disentangling latent features, the CAE provides powerful tool for precision breeding and genetic research. This work significantly enhances trait prediction models, advancing agricultural and biological sciences.
Related papers
- Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models [35.084222907099644]
We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling.
FreeFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.
arXiv Detail & Related papers (2024-10-02T17:53:08Z) - Generative Principal Component Regression via Variational Inference [2.4415762506639944]
One approach to designing appropriate manipulations is to target key features of predictive models.
We develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space.
We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs.
arXiv Detail & Related papers (2024-09-03T22:38:55Z) - LSTM Autoencoder-based Deep Neural Networks for Barley Genotype-to-Phenotype Prediction [16.99449054451577]
We propose a new LSTM autoencoder-based model for barley genotype-to-phenotype prediction, specifically for flowering time and grain yield estimation.
Our model outperformed the other baseline methods, demonstrating its potential in handling complex high-dimensional agricultural datasets.
arXiv Detail & Related papers (2024-07-21T16:07:43Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Cancer Subtyping by Improved Transcriptomic Features Using Vector
Quantized Variational Autoencoder [10.835673227875615]
We propose Vector Quantized Variational AutoEncoder (VQ-VAE) to tackle the data issues and extract informative latent features that are crucial to the quality of subsequent clustering.
VQ-VAE does not impose strict assumptions and hence its latent features are better representations of the input, capable of yielding superior clustering performance with any mainstream clustering method.
arXiv Detail & Related papers (2022-07-20T09:47:53Z) - Differentiable Agent-based Epidemiology [71.81552021144589]
We introduce GradABM: a scalable, differentiable design for agent-based modeling that is amenable to gradient-based learning with automatic differentiation.
GradABM can quickly simulate million-size populations in few seconds on commodity hardware, integrate with deep neural networks and ingest heterogeneous data sources.
arXiv Detail & Related papers (2022-07-20T07:32:02Z) - Deep Variational Models for Collaborative Filtering-based Recommender
Systems [63.995130144110156]
Deep learning provides accurate collaborative filtering models to improve recommender system results.
Our proposed models apply the variational concept to injectity in the latent space of the deep architecture.
Results show the superiority of the proposed approach in scenarios where the variational enrichment exceeds the injected noise effect.
arXiv Detail & Related papers (2021-07-27T08:59:39Z) - Mycorrhiza: Genotype Assignment usingPhylogenetic Networks [2.286041284499166]
We introduce Mycorrhiza, a machine learning approach for the genotype assignment problem.
Our algorithm makes use of phylogenetic networks to engineer features that encode the evolutionary relationships among samples.
Mycorrhiza yields particularly significant gains on datasets with a large average fixation index (FST) or deviation from the Hardy-Weinberg equilibrium.
arXiv Detail & Related papers (2020-10-14T02:36:27Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Repulsive Mixture Models of Exponential Family PCA for Clustering [127.90219303669006]
The mixture extension of exponential family principal component analysis ( EPCA) was designed to encode much more structural information about data distribution than the traditional EPCA.
The traditional mixture of local EPCAs has the problem of model redundancy, i.e., overlaps among mixing components, which may cause ambiguity for data clustering.
In this paper, a repulsiveness-encouraging prior is introduced among mixing components and a diversified EPCA mixture (DEPCAM) model is developed in the Bayesian framework.
arXiv Detail & Related papers (2020-04-07T04:07:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.