A Review and Analysis of a Parallel Approach for Decision Tree Learning from Large Data Streams
- URL: http://arxiv.org/abs/2505.11780v1
- Date: Sat, 17 May 2025 01:07:25 GMT
- Title: A Review and Analysis of a Parallel Approach for Decision Tree Learning from Large Data Streams
- Authors: Zeinab Shiralizadeh,
- Abstract summary: This work studies one of the parallel decision tree learning algorithms, pdsCART, designed for scalable and efficient data analysis.<n>It supports real-time learning from data streams, allowing trees to be constructed incrementally.<n>Second, it enables parallel processing of high-volume streaming data, making it well-suited for large-scale applications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work studies one of the parallel decision tree learning algorithms, pdsCART, designed for scalable and efficient data analysis. The method incorporates three core capabilities. First, it supports real-time learning from data streams, allowing trees to be constructed incrementally. Second, it enables parallel processing of high-volume streaming data, making it well-suited for large-scale applications. Third, the algorithm integrates seamlessly into the MapReduce framework, ensuring compatibility with distributed computing environments. In what follows, we present the algorithm's key components along with results highlighting its performance and scalability.
Related papers
- A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Relation-aware Ensemble Learning for Knowledge Graph Embedding [68.94900786314666]
We propose to learn an ensemble by leveraging existing methods in a relation-aware manner.
exploring these semantics using relation-aware ensemble leads to a much larger search space than general ensemble methods.
We propose a divide-search-combine algorithm RelEns-DSC that searches the relation-wise ensemble weights independently.
arXiv Detail & Related papers (2023-10-13T07:40:12Z) - Parallel Tree Kernel Computation [0.0]
We devise a parallel implementation of the sequential algorithm for the computation of some tree kernels of two finite sets of trees.
Results show that the proposed parallel algorithm outperforms the sequential one in terms of latency.
arXiv Detail & Related papers (2023-05-12T18:16:45Z) - Block size estimation for data partitioning in HPC applications using
machine learning techniques [38.063905789566746]
This paper describes a methodology, namely BLEST-ML (BLock size ESTimation through Machine Learning), for block size estimation.
The proposed methodology was evaluated by designing an implementation tailored to dislib, a distributed computing library.
The results we obtained show the ability of BLEST-ML to efficiently determine a suitable way to split a given dataset.
arXiv Detail & Related papers (2022-11-19T23:04:14Z) - Stabilizing Q-learning with Linear Architectures for Provably Efficient
Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation.
We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z) - An Improved Reinforcement Learning Algorithm for Learning to Branch [12.27934038849211]
Branch-and-bound (B&B) is a general and widely used method for optimization.
In this paper, we propose a novel reinforcement learning-based B&B algorithm.
We evaluate the performance of the proposed algorithm over three public research benchmarks.
arXiv Detail & Related papers (2022-01-17T04:50:11Z) - Towards Large Scale Automated Algorithm Design by Integrating Modular
Benchmarking Frameworks [0.9281671380673306]
We present a first proof-of-concept use-case that demonstrates the efficiency of the algorithm framework ParadisEO with the automated algorithm configuration tool irace and the experimental platform IOHprofiler.
Key advantages of our pipeline are fast evaluation times, the possibility to generate rich data sets, and a standardized interface that can be used to benchmark very broad classes of sampling-based optimizations.
arXiv Detail & Related papers (2021-02-12T10:47:00Z) - Triclustering in Big Data Setting [2.752817022620644]
We describe versions of triclustering algorithms adapted for efficient calculations in distributed environments with MapReduce model or parallelisation mechanism provided by modern programming languages.
OAC-family of triclustering algorithms shows good parallelisation capabilities due to the independent processing of triples of a triadic formal context.
arXiv Detail & Related papers (2020-10-24T16:55:55Z) - Towards Efficient and Scalable Acceleration of Online Decision Tree
Learning on FPGA [20.487660974785943]
In the era of big data, traditional decision tree induction algorithms are not suitable for learning large-scale datasets.
We introduce a new quantile-based algorithm to improve the induction of the Hoeffding tree, one of the state-of-the-art online learning models.
We present a high-performance, hardware-efficient and scalable online decision tree learning system on a field-programmable gate array.
arXiv Detail & Related papers (2020-09-03T03:23:43Z) - Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning.
We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations.
We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z) - MurTree: Optimal Classification Trees via Dynamic Programming and Search [61.817059565926336]
We present a novel algorithm for learning optimal classification trees based on dynamic programming and search.
Our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances.
arXiv Detail & Related papers (2020-07-24T17:06:55Z) - Image Matching across Wide Baselines: From Paper to Practice [80.9424750998559]
We introduce a comprehensive benchmark for local features and robust estimation algorithms.
Our pipeline's modular structure allows easy integration, configuration, and combination of different methods.
We show that with proper settings, classical solutions may still outperform the perceived state of the art.
arXiv Detail & Related papers (2020-03-03T15:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.