HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets
- URL: http://arxiv.org/abs/2509.09740v1
- Date: Wed, 10 Sep 2025 22:25:33 GMT
- Title: HypoGeneAgent: A Hypothesis Language Agent for Gene-Set Cluster Resolution Selection Using Perturb-seq Datasets
- Authors: Ying Yuan, Xing-Yue Monica Ge, Aaron Archer Waterman, Tommaso Biancalani, David Richmond, Yogesh Pandit, Avtar Singh, Russell Littman, Jin Liu, Jan-Christian Huetter, Vladimir Ermakov,
- Abstract summary: Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells.<n>We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task.
- Score: 9.465853880657487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale single-cell and Perturb-seq investigations routinely involve clustering cells and subsequently annotating each cluster with Gene-Ontology (GO) terms to elucidate the underlying biological programs. However, both stages, resolution selection and functional annotation, are inherently subjective, relying on heuristics and expert curation. We present HYPOGENEAGENT, a large language model (LLM)-driven framework, transforming cluster annotation into a quantitatively optimizable task. Initially, an LLM functioning as a gene-set analyst analyzes the content of each gene program or perturbation module and generates a ranked list of GO-based hypotheses, accompanied by calibrated confidence scores. Subsequently, we embed every predicted description with a sentence-embedding model, compute pair-wise cosine similarities, and let the agent referee panel score (i) the internal consistency of the predictions, high average similarity within the same cluster, termed intra-cluster agreement (ii) their external distinctiveness, low similarity between clusters, termed inter-cluster separation. These two quantities are combined to produce an agent-derived resolution score, which is maximized when clusters exhibit simultaneous coherence and mutual exclusivity. When applied to a public K562 CRISPRi Perturb-seq dataset as a preliminary test, our Resolution Score selects clustering granularities that exhibit alignment with known pathway compared to classical metrics such silhouette score, modularity score for gene functional enrichment summary. These findings establish LLM agents as objective adjudicators of cluster resolution and functional annotation, thereby paving the way for fully automated, context-aware interpretation pipelines in single-cell multi-omics studies.
Related papers
- GALACTIC: Global and Local Agnostic Counterfactuals for Time-series Clustering [9.33534873952683]
This paper introduces GALACTIC, the first unified framework to bridge local and global counterfactual explainability for unsupervised time-series clustering.<n>At instance level (local), GALACTIC generates perturbations via a cluster-aware optimization objective.<n>At cluster level (global), to mitigate cognitive load and enhance interpretability, we formulate a representative CE selection problem.
arXiv Detail & Related papers (2026-03-05T15:59:33Z) - Hierarchical Clustering With Confidence [6.4793198569929356]
Agglomerative hierarchical clustering is highly sensitive to small perturbations in the data.<n>We show how randomizing hierarchical clustering can be useful not just for measuring stability but also for designing valid hypothesis testing procedures.
arXiv Detail & Related papers (2025-12-06T18:18:20Z) - Parameter-Free Clustering via Self-Supervised Consensus Maximization (Extended Version) [50.41628860536753]
We propose a novel and fully parameter-free clustering framework via Self-supervised Consensus Maximization, named SCMax.<n>Our framework performs hierarchical agglomerative clustering and cluster evaluation in a single, integrated process.
arXiv Detail & Related papers (2025-11-12T11:17:17Z) - Reliable data clustering with Bayesian community detection [0.0]
Researchers rely on clustering similarity data to uncover modular structure.<n>Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise.<n>A common workaround sparsifies a correlation matrix representation to remove noise before clustering, but this extra step introduces arbitrary thresholds that can distort the structure and lead to unreliable results.
arXiv Detail & Related papers (2025-10-16T14:10:24Z) - GOLFS: Feature Selection via Combining Both Global and Local Information for High Dimensional Clustering [10.740524877905685]
We propose a new unsupervised feature selection method, named GlObal and Local information combined Feature Selection (GOLFS)<n>GOLFS combines both local geometric structure via manifold learning and global correlation structure of samples to select the discriminative features.<n>The combination improves the accuracy of both feature selection and clustering by exploiting more comprehensive information.
arXiv Detail & Related papers (2025-07-15T03:39:07Z) - scGHSOM: Hierarchical clustering and visualization of single-cell and CRISPR data using growing hierarchical SOM [0.8452349885923507]
We propose a comprehensive gene-cell dependency visualization via unsupervised clustering, Growing Hierarchical Self-Organizing Map (GHSOM)
GHSOM is applied to cluster samples in a hierarchical structure such that the self-growth structure of clusters satisfies the required variations between and within.
We present two innovative visualization tools: Cluster Feature Map and Cluster Distribution Map.
arXiv Detail & Related papers (2024-07-24T04:01:09Z) - Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis [53.38518232934096]
Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance.
We propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features.
In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is motivated by applications to Earth science.
arXiv Detail & Related papers (2024-06-12T08:30:16Z) - GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [85.51611950757643]
We propose IAC (Instance-Adaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability.<n>IAC maintains an overall computational complexity of $ mathcalO(n, textpolylog(n) $, making it scalable and practical for large-scale problems.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection [19.066989850964756]
We introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI.
This algorithm avoids the burden of feature exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model.
Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.
arXiv Detail & Related papers (2023-02-07T10:52:04Z) - Simple and Scalable Algorithms for Cluster-Aware Precision Medicine [0.0]
We propose a simple and scalable approach to joint clustering and embedding.
This novel, cluster-aware embedding approach overcomes the complexity and limitations of current joint embedding and clustering methods.
Our approach does not require the user to choose the desired number of clusters, but instead yields interpretable dendrograms of hierarchically clustered embeddings.
arXiv Detail & Related papers (2022-11-29T19:27:26Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Inference post Selection of Group-sparse Regression Models [2.1485350418225244]
Conditional inference provides a rigorous approach to counter bias when data from automated model selections is reused for inference.
We develop in this paper a statistically consistent Bayesian framework to assess uncertainties within linear models.
Finding wide applications when genes, proteins, genetic variants, neuroimaging measurements are grouped respectively by their biological pathways, molecular functions, regulatory regions, cognitive roles, these models are selected through a useful class of group-sparse learning algorithms.
arXiv Detail & Related papers (2020-12-31T15:43:26Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.