Block size estimation for data partitioning in HPC applications using
machine learning techniques
- URL: http://arxiv.org/abs/2211.10819v2
- Date: Wed, 31 Jan 2024 22:02:20 GMT
- Title: Block size estimation for data partitioning in HPC applications using
machine learning techniques
- Authors: Riccardo Cantini, Fabrizio Marozzo, Alessio Orsino, Domenico Talia,
Paolo Trunfio, Rosa M. Badia, Jorge Ejarque, Fernando Vazquez
- Abstract summary: This paper describes a methodology, namely BLEST-ML (BLock size ESTimation through Machine Learning), for block size estimation.
The proposed methodology was evaluated by designing an implementation tailored to dislib, a distributed computing library.
The results we obtained show the ability of BLEST-ML to efficiently determine a suitable way to split a given dataset.
- Score: 38.063905789566746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The extensive use of HPC infrastructures and frameworks for running
dataintensive applications has led to a growing interest in data partitioning
techniques and strategies. In fact, application performance can be heavily
affected by how data are partitioned, which in turn depends on the selected
size for data blocks, i.e. the block size. Therefore, finding an effective
partitioning, i.e. a suitable block size, is a key strategy to speed-up
parallel data-intensive applications and increase scalability. This paper
describes a methodology, namely BLEST-ML (BLock size ESTimation through Machine
Learning), for block size estimation that relies on supervised machine learning
techniques. The proposed methodology was evaluated by designing an
implementation tailored to dislib, a distributed computing library highly
focused on machine learning algorithms built on top of the PyCOMPSs framework.
We assessed the effectiveness of the provided implementation through an
extensive experimental evaluation considering different algorithms from dislib,
datasets, and infrastructures, including the MareNostrum 4 supercomputer. The
results we obtained show the ability of BLEST-ML to efficiently determine a
suitable way to split a given dataset, thus providing a proof of its
applicability to enable the efficient execution of data-parallel applications
in high performance environments.
Related papers
- Efficient $k$-NN Search in IoT Data: Overlap Optimization in Tree-Based Indexing Structures [0.6990493129893112]
The proliferation of interconnected devices in the Internet of Things (IoT) has led to an exponential increase in data.
Efficient retrieval of this heterogeneous data demands a robust indexing mechanism for effective organization.
We propose three innovatives designed to quantify and strategically reduce data space partition overlap.
arXiv Detail & Related papers (2024-08-28T16:16:55Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Learned spatial data partitioning [7.342228103959199]
We first study learned spatial data partitioning, which effectively assigns groups of big spatial data to computers based on locations of data.
We formalize spatial data partitioning in the context of reinforcement learning and develop a novel deep reinforcement learning algorithm.
Our method efficiently finds partitions for accelerating distance join queries and reduces the workload run time by up to 59.4%.
arXiv Detail & Related papers (2023-06-08T00:42:10Z) - Scalable Batch Acquisition for Deep Bayesian Active Learning [70.68403899432198]
In deep active learning, it is important to choose multiple examples to markup at each step.
Existing solutions to this problem, such as BatchBALD, have significant limitations in selecting a large number of examples.
We present the Large BatchBALD algorithm, which aims to achieve comparable quality while being more computationally efficient.
arXiv Detail & Related papers (2023-01-13T11:45:17Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - LSEC: Large-scale spectral ensemble clustering [8.545202841051582]
We propose a large-scale spectral ensemble clustering (LSEC) method to strike a good balance between efficiency and effectiveness.
The LSEC method achieves a lower computational complexity than most existing ensemble clustering methods.
arXiv Detail & Related papers (2021-06-18T00:42:03Z) - An Accurate and Efficient Large-scale Regression Method through Best
Friend Clustering [10.273838113763192]
We propose a novel and simple data structure capturing the most important information among data samples.
We combine the clustering with regression techniques as a parallel library and utilize a hybrid structure of data and model parallelism to make predictions.
arXiv Detail & Related papers (2021-04-22T01:34:29Z) - Structured Inverted-File k-Means Clustering for High-Dimensional Sparse
Data [2.487445341407889]
This paper presents an architecture-friendly k-means clustering algorithm called SIVF for a large-scale and high-dimensional sparse data set.
Our performance analysis reveals that SIVF achieves the higher speed by suppressing performance degradation factors of the number of cache misses and branch mispredictions.
arXiv Detail & Related papers (2021-03-30T07:54:02Z) - A Survey on Large-scale Machine Learning [67.6997613600942]
Machine learning can provide deep insights into data, allowing machines to make high-quality predictions.
Most sophisticated machine learning approaches suffer from huge time costs when operating on large-scale data.
Large-scale Machine Learning aims to learn patterns from big data with comparable performance efficiently.
arXiv Detail & Related papers (2020-08-10T06:07:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.