Related papers: Efficient Identification of High Similarity Clusters in Polygon Datasets

Efficient Identification of High Similarity Clusters in Polygon Datasets

URL: http://arxiv.org/abs/2509.23942v1
Date: Sun, 28 Sep 2025 15:39:15 GMT
Title: Efficient Identification of High Similarity Clusters in Polygon Datasets
Authors: John N. Daras,
Abstract summary: We propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems.<n>The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization.<n>Our approach achieves substantial reductions in computational cost without sacrificing accuracy.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advancements in tools like Shapely 2.0 and Triton can significantly improve the efficiency of spatial similarity computations by enabling faster and more scalable geometric operations. However, for extremely large datasets, these optimizations may face challenges due to the sheer volume of computations required. To address this, we propose a framework that reduces the number of clusters requiring verification, thereby decreasing the computational load on these systems. The framework integrates dynamic similarity index thresholding, supervised scheduling, and recall-constrained optimization to efficiently identify clusters with the highest spatial similarity while meeting user-defined precision and recall requirements. By leveraging Kernel Density Estimation (KDE) to dynamically determine similarity thresholds and machine learning models to prioritize clusters, our approach achieves substantial reductions in computational cost without sacrificing accuracy. Experimental results demonstrate the scalability and effectiveness of the method, offering a practical solution for large-scale geospatial analysis.

Related papers

Sparse Convex Biclustering [3.067019303674385]
Biclustering robustness is an essential machine learning technique for simultaneously clustering rows and columns of a data matrix.<n>We propose a novel method that penalizes noise during the biclustering process to improve both accuracy and stability.
arXiv Detail & Related papers (2026-01-05T03:15:52Z)
Data Skeleton Learning: Scalable Active Clustering with Sparse Graph Structures [14.417696261026492]
We propose a graph-based active clustering algorithm that utilizes two sparse graphs.<n>These two graphs work in concert, enabling the refinement of connected subgraphs within the data skeleton to create nested clusters.<n>Our empirical analysis confirms that the proposed algorithm consistently facilitates more accurate clustering with dramatically less input of user-provided constraints.
arXiv Detail & Related papers (2025-09-10T12:18:52Z)
CAS Condensed and Accelerated Silhouette: An Efficient Method for Determining the Optimal K in K-Means Clustering [0.0]
This paper presents strategies for selecting the optimal value of k in clustering.<n>It focuses on achieving a balance between clustering precision and computational efficiency in complex data environments.<n>The proposed approach achieves up to 99 percent faster execution times on high-dimensional datasets.
arXiv Detail & Related papers (2025-07-11T05:03:16Z)
iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use [56.31110409360567]
Augmenting large language models with external tools is a promising approach to enhance their capabilities.<n>We show that training gains significantly decay as synthetic data increases.<n>We propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation.
arXiv Detail & Related papers (2025-01-15T04:52:34Z)
Scalable Co-Clustering for Large-Scale Data through Dynamic Partitioning and Hierarchical Merging [7.106620444966807]
Co-clustering simultaneously clusters rows and columns, revealing more fine-grained groups.<n>Existing co-clustering methods suffer from poor scalability and cannot handle large-scale data.<n>This paper presents a novel and scalable co-clustering method designed to uncover intricate patterns in high-dimensional, large-scale datasets.
arXiv Detail & Related papers (2024-10-09T04:47:22Z)
Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z)
Center-Sensitive Kernel Optimization for Efficient On-Device Incremental Learning [88.78080749909665]
Current on-device training methods just focus on efficient training without considering the catastrophic forgetting.<n>This paper proposes a simple but effective edge-friendly incremental learning framework.<n>Our method achieves average accuracy boost of 38.08% with even less memory and approximate computation.
arXiv Detail & Related papers (2024-06-13T05:49:29Z)
Sample-Efficient "Clustering and Conquer" Procedures for Parallel Large-Scale Ranking and Selection [0.0]
We modify the commonly used "divide and conquer" framework in parallel computing by adding a correlation-based clustering step.<n>This seemingly simple modification achieves the optimal sample complexity reduction for a widely used class of efficient large-scale R&S procedures.<n>In large-scale AI applications such as neural architecture search, our methods demonstrate superior performance.
arXiv Detail & Related papers (2024-02-03T15:56:03Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
Randomized Dimension Reduction with Statistical Guarantees [0.27195102129095]
This thesis explores some of such algorithms for fast execution and efficient data utilization. We focus on learning algorithms with various incorporations of data augmentation that improve generalization and distributional provably. Specifically, Chapter 4 presents a sample complexity analysis for data augmentation consistency regularization.
arXiv Detail & Related papers (2023-10-03T02:01:39Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
Fast Distributionally Robust Learning with Variance Reduced Min-Max Optimization [85.84019017587477]
Distributionally robust supervised learning is emerging as a key paradigm for building reliable machine learning systems for real-world applications. Existing algorithms for solving Wasserstein DRSL involve solving complex subproblems or fail to make use of gradients. We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable extra-gradient algorithms.
arXiv Detail & Related papers (2021-04-27T16:56:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.