Improving the performance of bagging ensembles for data streams through
mini-batching
- URL: http://arxiv.org/abs/2112.09834v1
- Date: Sat, 18 Dec 2021 03:44:07 GMT
- Title: Improving the performance of bagging ensembles for data streams through
mini-batching
- Authors: Guilherme Cassales, Heitor Gomes, Albert Bifet, Bernhard Pfahringer,
Hermes Senger
- Abstract summary: Machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams.
Stream processing algorithms have additional requirements regarding computational resources and adaptability to data evolution.
This paper proposes a mini-batching strategy that can improve memory access locality and performance of several ensemble algorithms for stream mining in multi-core environments.
- Score: 9.418151228755834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Often, machine learning applications have to cope with dynamic environments
where data are collected in the form of continuous data streams with
potentially infinite length and transient behavior. Compared to traditional
(batch) data mining, stream processing algorithms have additional requirements
regarding computational resources and adaptability to data evolution. They must
process instances incrementally because the data's continuous flow prohibits
storing data for multiple passes. Ensemble learning achieved remarkable
predictive performance in this scenario. Implemented as a set of (several)
individual classifiers, ensembles are naturally amendable for task parallelism.
However, the incremental learning and dynamic data structures used to capture
the concept drift increase the cache misses and hinder the benefit of
parallelism. This paper proposes a mini-batching strategy that can improve
memory access locality and performance of several ensemble algorithms for
stream mining in multi-core environments. With the aid of a formal framework,
we demonstrate that mini-batching can significantly decrease the reuse distance
(and the number of cache misses). Experiments on six different state-of-the-art
ensemble algorithms applying four benchmark datasets with varied
characteristics show speedups of up to 5X on 8-core processors. These benefits
come at the expense of a small reduction in predictive performance.
Related papers
- Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Adaptive Cross Batch Normalization for Metric Learning [75.91093210956116]
Metric learning is a fundamental problem in computer vision.
We show that it is equally important to ensure that the accumulated embeddings are up to date.
In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration.
arXiv Detail & Related papers (2023-03-30T03:22:52Z) - Less is More: Reducing Task and Model Complexity for 3D Point Cloud
Semantic Segmentation [26.94284739177754]
New pipeline requires fewer ground-truth annotations to achieve superior segmentation accuracy.
New Sparse Depthwise Separable Convolution module significantly reduces the network parameter count.
New Spatio-Temporal Redundant Frame Downsampling (ST-RFD) method extracts a more diverse subset of training data frame samples.
arXiv Detail & Related papers (2023-03-20T15:36:10Z) - Performance Embeddings: A Similarity-based Approach to Automatic
Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications.
We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Improved Multi-objective Data Stream Clustering with Time and Memory
Optimization [0.0]
This paper introduces a new data stream clustering method (IMOC-Stream)
It uses two different objective functions to capture different aspects of the data.
The experiments show the ability of our method to partition the data stream in arbitrarily shaped, compact, and well-separated clusters.
arXiv Detail & Related papers (2022-01-13T17:05:56Z) - Parallel Actors and Learners: A Framework for Generating Scalable RL
Implementations [14.432131909590824]
Reinforcement Learning (RL) has achieved significant success in application domains such as robotics, games, health care and others.
Current implementations exhibit poor performance due to challenges such as irregular memory accesses and synchronization overheads.
We propose a framework for generating scalable reinforcement learning implementations on multicore systems.
arXiv Detail & Related papers (2021-10-03T21:00:53Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Sparse Convolutions on Continuous Domains for Point Cloud and Event
Stream Networks [14.664758777845572]
We present an elegant sparse matrix-based interpretation of the convolution operator for unstructured continuous data like point clouds and event streams.
We demonstrate networks built with these operations can train an order of magnitude or more faster than top existing methods.
We also apply our operator to event stream processing, achieving state-of-the-art results on multiple tasks with streams of hundreds of thousands of events.
arXiv Detail & Related papers (2020-12-02T13:05:02Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z) - Ranking and benchmarking framework for sampling algorithms on synthetic
data streams [0.0]
In big data, AI, and streaming processing, we work with large amounts of data from multiple sources.
Due to memory and network limitations, we process data streams on distributed systems to alleviate computational and network loads.
We provide algorithms that react to concept drifts and compare those against the state-of-the-art algorithms using our framework.
arXiv Detail & Related papers (2020-06-17T14:25:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.