Related papers: CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Scientific Discovery

CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Scientific Discovery

URL: http://arxiv.org/abs/2601.09768v1
Date: Wed, 14 Jan 2026 11:21:05 GMT
Title: CLiMB: A Domain-Informed Novelty Detection Clustering Framework for Scientific Discovery
Authors: Lorenzo Monti, Tatiana Muraveva, Brian Sheridan, Davide Massari, Alessia Garofalo, Gisella Clementini, Umberto Michelucci,
Abstract summary: CLiMB is a domain-informed framework for domain-informed clustering.<n>It exploits prior knowledge from the exploration of unknown structures.<n>CLiMB attains an Adjusted Rand Index of 0.829 with 90% seed coverage in recovering known Milky Way substructures.
Score: 1.0554048699217669
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In data-driven scientific discovery, a challenge lies in classifying well-characterized phenomena while identifying novel anomalies. Current semi-supervised clustering algorithms do not always fully address this duality, often assuming that supervisory signals are globally representative. Consequently, methods often enforce rigid constraints that suppress unanticipated patterns or require a pre-specified number of clusters, rendering them ineffective for genuine novelty detection. To bridge this gap, we introduce CLiMB (CLustering in Multiphase Boundaries), a domain-informed framework decoupling the exploitation of prior knowledge from the exploration of unknown structures. Using a sequential two-phase approach, CLiMB first anchors known clusters using constrained partitioning, and subsequently applies density-based clustering to residual data to reveal arbitrary topologies. We demonstrate this framework on RR Lyrae stars data from the Gaia Data Release 3. CLiMB attains an Adjusted Rand Index of 0.829 with 90% seed coverage in recovering known Milky Way substructures, drastically outperforming heuristic and constraint-based baselines, which stagnate below 0.20. Furthermore, sensitivity analysis confirms CLiMB's superior data efficiency, showing monotonic improvement as knowledge increases. Finally, the framework successfully isolates three dynamical features (Shiva, Shakti, and the Galactic Disk) in the unlabelled field, validating its potential for scientific discovery.

Related papers

Latent Sculpting for Zero-Shot Generalization: A Manifold Learning Approach to Out-of-Distribution Anomaly Detection [2.8547732086436306]
A fundamental limitation of supervised deep learning is "Generalization Collapse"<n>We propose Latent Sculpting, a hierarchical two-stage representation learning framework.<n>We report an 88.89% detection rate on "Infiltration" scenarios.
arXiv Detail & Related papers (2025-12-19T11:37:02Z)
Hyperbolic Gaussian Blurring Mean Shift: A Statistical Mode-Seeking Framework for Clustering in Curved Spaces [15.555757275390846]
Clustering is a fundamental unsupervised learning task for uncovering patterns in data.<n>In this work, we introduce HypeGBMS, a novel extension of GBMS to hyperbolic space.<n>Our method replaces Euclidean computations with hyperbolic distances and employs Mbius-weighted means to ensure that all updates remain consistent with the geometry of the space.
arXiv Detail & Related papers (2025-12-12T10:40:26Z)
Reliable data clustering with Bayesian community detection [0.0]
Researchers rely on clustering similarity data to uncover modular structure.<n>Yet widely used clustering methods, such as hierarchical clustering, k-means, and WGCNA, lack principled model selection, leaving them susceptible to noise.<n>A common workaround sparsifies a correlation matrix representation to remove noise before clustering, but this extra step introduces arbitrary thresholds that can distort the structure and lead to unreliable results.
arXiv Detail & Related papers (2025-10-16T14:10:24Z)
Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency [52.52950138164424]
We show that when leveraging the off-the-shelf (vision) foundation models for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets.<n>We embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition.<n>In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes.
arXiv Detail & Related papers (2025-08-19T05:22:59Z)
CLIP Meets Diffusion: A Synergistic Approach to Anomaly Detection [49.11819337853632]
Anomaly detection is a complex problem due to the ambiguity in defining anomalies, the diversity of anomaly types, and the scarcity of training data.<n>We propose CLIPfusion, a method that leverages both discriminative and generative foundation models.<n>We believe that our method underscores the effectiveness of multi-modal and multi-model fusion in tackling the multifaceted challenges of anomaly detection.
arXiv Detail & Related papers (2025-06-13T13:30:15Z)
Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning [53.527506374566485]
We propose a novel Adaptive and Robust DBSCAN with Multi-agent Reinforcement Learning cluster framework, namely AR-DBSCAN.<n>We show that AR-DBSCAN not only improves clustering accuracy by up to 144.1% and 175.3% in the NMI and ARI metrics, respectively, but also is capable of robustly finding dominant parameters.
arXiv Detail & Related papers (2025-05-07T11:37:23Z)
Crane: Context-Guided Prompt Learning and Attention Refinement for Zero-Shot Anomaly Detection [50.343419243749054]
Anomaly detection is critical in fields such as medical diagnostics and industrial defect detection.<n> CLIP's coarse-grained image-text alignment limits localization and detection performance for fine-grained anomalies.<n>Crane improves the state-of-the-art ZSAD from 2% to 28%, at both image and pixel levels, while remaining competitive in inference speed.
arXiv Detail & Related papers (2025-04-15T10:42:25Z)
Cluster Quilting: Spectral Clustering for Patchwork Learning [8.500141848121782]
We focus on the clustering problem in patchwork learning, aiming at discovering clusters amongst all samples even when some are never jointly observed for any feature. We propose a novel spectral clustering method called Cluster Quilting, consisting of (i) patch ordering that exploits the overlapping structure amongst all patches, (ii) patchwise SVD, (iii) sequential linear mapping of top singular vectors for patch overlaps, followed by (iv) k-means on the combined and weighted singular vectors. Under a sub-Gaussian mixture model, we establish theoretical guarantees via a non-asymptotic misclustering rate bound that reflects both
arXiv Detail & Related papers (2024-06-19T20:52:47Z)
GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure. First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples. Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z)
Learning for Transductive Threshold Calibration in Open-World Recognition [83.35320675679122]
We introduce OpenGCN, a Graph Neural Network-based transductive threshold calibration method with enhanced robustness and adaptability. Experiments across open-world visual recognition benchmarks validate OpenGCN's superiority over existing posthoc calibration methods for open-world threshold calibration.
arXiv Detail & Related papers (2023-05-19T23:52:48Z)
Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model. A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations. We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z)
CycleCluster: Modernising Clustering Regularisation for Deep Semi-Supervised Classification [0.0]
We propose a novel framework, CycleCluster, for deep semi-supervised classification. Our core optimisation is driven by a new clustering based regularisation along with a graph based pseudo-labels and a shared deep network.
arXiv Detail & Related papers (2020-01-15T13:34:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.