Abundance-Aware Set Transformer for Microbiome Sample Embedding
- URL: http://arxiv.org/abs/2508.11075v1
- Date: Thu, 14 Aug 2025 21:15:53 GMT
- Title: Abundance-Aware Set Transformer for Microbiome Sample Embedding
- Authors: Hyunwoo Yoo, Gail Rosen,
- Abstract summary: We propose an abundance-aware variant of the Set Transformer to construct fixed-size sample-level embeddings.<n>Our method outperforms average pooling and unweighted Set Transformers on real-world microbiome classification tasks.
- Score: 0.44198435146063353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Microbiome sample representation to input into LLMs is essential for downstream tasks such as phenotype prediction and environmental classification. While prior studies have explored embedding-based representations of each microbiome sample, most rely on simple averaging over sequence embeddings, often overlooking the biological importance of taxa abundance. In this work, we propose an abundance-aware variant of the Set Transformer to construct fixed-size sample-level embeddings by weighting sequence embeddings according to their relative abundance. Without modifying the model architecture, we replicate embedding vectors proportional to their abundance and apply self-attention-based aggregation. Our method outperforms average pooling and unweighted Set Transformers on real-world microbiome classification tasks, achieving perfect performance in some cases. These results demonstrate the utility of abundance-aware aggregation for robust and biologically informed microbiome representation. To the best of our knowledge, this is one of the first approaches to integrate sequence-level abundance into Transformer-based sample embeddings.
Related papers
- Large-scale EM Benchmark for Multi-Organelle Instance Segmentation in the Wild [8.670858548670742]
We develop a benchmark for multi-organelle instance segmentation, comprising over 100,000 2D EM images across variety cell types and five organelle classes that capture real-world variability.<n>Our results show several limitations: current models struggle to generalize across heterogeneous EM data and perform poorly on organelles with global, distributed morphologies.<n>These findings underscore the fundamental mismatch between local-context models and the challenge of modeling long-range structural continuity in the presence of real-world variability.
arXiv Detail & Related papers (2026-01-18T16:09:27Z) - MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers [0.0]
We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space.<n>Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone.<n>We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with Scrublet, evaluating integration fidelity, rare population detection, and modality transfer.
arXiv Detail & Related papers (2025-11-25T15:04:06Z) - Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity [0.39945675027960637]
We introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling.<n>GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines.<n>We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness.
arXiv Detail & Related papers (2025-04-22T20:34:47Z) - Hierarchical Sparse Bayesian Multitask Model with Scalable Inference for Microbiome Analysis [1.361248247831476]
This paper proposes a hierarchical Bayesian multitask learning model that is applicable to the general multi-task binary classification learning problem.<n>We derive a computationally efficient inference algorithm based on variational inference to approximate the posterior distribution.<n>We demonstrate the potential of the new approach on various synthetic datasets and for predicting human health status based on microbiome profile.
arXiv Detail & Related papers (2025-02-04T18:23:22Z) - Constructing Cell-type Taxonomy by Optimal Transport with Relaxed Marginal Constraints [14.831346286039151]
One challenge in the cluster analysis of cells is matching clusters extracted from datasets of different origins or conditions.<n>Our approach aims to construct a taxonomy for cell clusters across all samples to better annotate these clusters and effectively extract features for downstream analysis.
arXiv Detail & Related papers (2025-01-29T21:29:25Z) - Vision Transformers for Weakly-Supervised Microorganism Enumeration [0.0]
This study conducts a comparative analysis of vision transformers (ViTs) for weakly-supervised counting in microorganism enumeration.<n>We trained different versions of ViTs as the architectural backbone for feature extraction using four microbiology datasets.<n>Results indicate that while ResNets perform better overall, ViTs performance demonstrates competent results across all datasets.
arXiv Detail & Related papers (2024-12-03T08:27:20Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - Learning Multiscale Consistency for Self-supervised Electron Microscopy
Instance Segmentation [48.267001230607306]
We propose a pretraining framework that enhances multiscale consistency in EM volumes.
Our approach leverages a Siamese network architecture, integrating strong and weak data augmentations.
It effectively captures voxel and feature consistency, showing promise for learning transferable representations for EM analysis.
arXiv Detail & Related papers (2023-08-19T05:49:13Z) - Energy-Based Test Sample Adaptation for Domain Generalization [81.04943285281072]
We propose energy-based sample adaptation at test time for domain.
To adapt target samples to source distributions, we iteratively update the samples by energy minimization.
Experiments on six benchmarks for classification of images and microblog threads demonstrate the effectiveness of our proposal.
arXiv Detail & Related papers (2023-02-22T08:55:09Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - GANs with Variational Entropy Regularizers: Applications in Mitigating
the Mode-Collapse Issue [95.23775347605923]
Building on the success of deep learning, Generative Adversarial Networks (GANs) provide a modern approach to learn a probability distribution from observed samples.
GANs often suffer from the mode collapse issue where the generator fails to capture all existing modes of the input distribution.
We take an information-theoretic approach and maximize a variational lower bound on the entropy of the generated samples to increase their diversity.
arXiv Detail & Related papers (2020-09-24T19:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.