Scalable Graph Attention-based Instance Selection via Mini-Batch Sampling and Hierarchical Hashing
- URL: http://arxiv.org/abs/2502.20293v1
- Date: Thu, 27 Feb 2025 17:17:53 GMT
- Title: Scalable Graph Attention-based Instance Selection via Mini-Batch Sampling and Hierarchical Hashing
- Authors: Zahiriddin Rustamov, Ayham Zaitouny, Nazar Zaki,
- Abstract summary: Instance selection (IS) is important in machine learning for reducing dataset size while keeping key characteristics.<n>This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances.<n>We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that reduces through strategic batch processing, and a hierarchical hashing approach that allows for efficient similarity through random projections.
- Score: 0.24578723416255752
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Instance selection (IS) is important in machine learning for reducing dataset size while keeping key characteristics. Current IS methods often struggle with capturing complex relationships in high-dimensional spaces and scale with large datasets. This paper introduces a graph attention-based instance selection (GAIS) method that uses attention mechanisms to identify informative instances through their structural relationships in graph representations. We present two approaches for scalable graph construction: a distance-based mini-batch sampling technique that reduces computation through strategic batch processing, and a hierarchical hashing approach that allows for efficient similarity computation through random projections. The mini-batch approach keeps class distributions through stratified sampling, while the hierarchical hashing method captures relationships at multiple granularities through single-level, multi-level, and multi-view variants. Experiments across 39 datasets show that GAIS achieves reduction rates above 96\% while maintaining or improving model performance relative to state-of-the-art IS methods. The findings shows that the distance-based mini-batch approach offers an optimal balance of efficiency and effectiveness for large-scale datasets, while multi-view variants provide superior performance for complex, high-dimensional data, demonstrating that attention-based importance scoring can effectively identify instances crucial for maintaining decision boundaries without requiring exhaustive pairwise comparisons.
Related papers
- GAIS: A Novel Approach to Instance Selection with Graph Attention Networks [1.100197352932064]
This paper introduces a novel method called Graph Attention-based Instance Selection (GAIS) to identify the most informative instances in a dataset.<n>Experiments on 13 diverse datasets demonstrate that GAIS consistently outperforms traditional IS methods in terms of effectiveness.<n>Although GAIS exhibits slightly higher computational costs, its superior performance in maintaining accuracy with significantly reduced training data makes it a promising approach for graph-based data selection.
arXiv Detail & Related papers (2024-12-26T12:51:14Z) - Fast and Scalable Semi-Supervised Learning for Multi-View Subspace Clustering [13.638434337947302]
FSSMSC is a novel solution to the high computational complexity commonly found in existing approaches.
The method generates a consensus anchor graph across all views, representing each data point as a sparse linear combination of chosen landmarks.
The effectiveness and efficiency of FSSMSC are validated through extensive experiments on multiple benchmark datasets of varying scales.
arXiv Detail & Related papers (2024-08-11T06:54:00Z) - Data curation via joint example selection further accelerates multimodal learning [3.329535792151987]
We show that jointly selecting batches of data is more effective for learning than selecting examples independently.
We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points.
arXiv Detail & Related papers (2024-06-25T16:52:37Z) - One for all: A novel Dual-space Co-training baseline for Large-scale
Multi-View Clustering [42.92751228313385]
We propose a novel multi-view clustering model, named Dual-space Co-training Large-scale Multi-view Clustering (DSCMC)
The main objective of our approach is to enhance the clustering performance by leveraging co-training in two distinct spaces.
Our algorithm has an approximate linear computational complexity, which guarantees its successful application on large-scale datasets.
arXiv Detail & Related papers (2024-01-28T16:30:13Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Effective and Efficient Graph Learning for Multi-view Clustering [173.8313827799077]
We propose an effective and efficient graph learning model for multi-view clustering.
Our method exploits the view-similar between graphs of different views by the minimization of tensor Schatten p-norm.
Our proposed algorithm is time-economical and obtains the stable results and scales well with the data size.
arXiv Detail & Related papers (2021-08-15T13:14:28Z) - Spatial-Spectral Clustering with Anchor Graph for Hyperspectral Image [88.60285937702304]
This paper proposes a novel unsupervised approach called spatial-spectral clustering with anchor graph (SSCAG) for HSI data clustering.
The proposed SSCAG is competitive against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-04-24T08:09:27Z) - Attentional-Biased Stochastic Gradient Descent [74.49926199036481]
We present a provable method (named ABSGD) for addressing the data imbalance or label noise problem in deep learning.
Our method is a simple modification to momentum SGD where we assign an individual importance weight to each sample in the mini-batch.
ABSGD is flexible enough to combine with other robust losses without any additional cost.
arXiv Detail & Related papers (2020-12-13T03:41:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.