Ranking and benchmarking framework for sampling algorithms on synthetic
data streams
- URL: http://arxiv.org/abs/2006.09895v1
- Date: Wed, 17 Jun 2020 14:25:07 GMT
- Title: Ranking and benchmarking framework for sampling algorithms on synthetic
data streams
- Authors: J\'ozsef D\'aniel G\'asp\'ar, Martin Horv\'ath, Gy\H{o}z\H{o}
Horv\'ath and Zolt\'an Zvara
- Abstract summary: In big data, AI, and streaming processing, we work with large amounts of data from multiple sources.
Due to memory and network limitations, we process data streams on distributed systems to alleviate computational and network loads.
We provide algorithms that react to concept drifts and compare those against the state-of-the-art algorithms using our framework.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the fields of big data, AI, and streaming processing, we work with large
amounts of data from multiple sources. Due to memory and network limitations,
we process data streams on distributed systems to alleviate computational and
network loads. When data streams with non-uniform distributions are processed,
we often observe overloaded partitions due to the use of simple hash
partitioning. To tackle this imbalance, we can use dynamic partitioning
algorithms that require a sampling algorithm to precisely estimate the
underlying distribution of the data stream. There is no standardized way to
test these algorithms. We offer an extensible ranking framework with benchmark
and hyperparameter optimization capabilities and supply our framework with a
data generator that can handle concept drifts.
Our work includes a generator for dynamic micro-bursts that we can apply to
any data stream. We provide algorithms that react to concept drifts and compare
those against the state-of-the-art algorithms using our framework.
Related papers
- A Mirror Descent-Based Algorithm for Corruption-Tolerant Distributed Gradient Descent [57.64826450787237]
We show how to analyze the behavior of distributed gradient descent algorithms in the presence of adversarial corruptions.
We show how to use ideas from (lazy) mirror descent to design a corruption-tolerant distributed optimization algorithm.
Experiments based on linear regression, support vector classification, and softmax classification on the MNIST dataset corroborate our theoretical findings.
arXiv Detail & Related papers (2024-07-19T08:29:12Z) - An Algorithm for Streaming Differentially Private Data [7.726042106665366]
We derive an algorithm for differentially private synthetic streaming data generation, especially curated towards spatial datasets.
The utility of our algorithm is verified on both real-world and simulated datasets.
arXiv Detail & Related papers (2024-01-26T00:32:31Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Imbalanced Big Data Oversampling: Taxonomy, Algorithms, Software,
Guidelines and Future Directions [6.436899373275926]
We propose a holistic look on oversampling algorithms for imbalanced big data.
We introduce a Spark library with 14 state-of-the-art oversampling algorithms.
We evaluate the trade-off between accuracy and time complexity of oversampling algorithms.
arXiv Detail & Related papers (2021-07-24T01:49:46Z) - Providing Meaningful Data Summarizations Using Examplar-based Clustering
in Industry 4.0 [67.80123919697971]
We show, that our GPU implementation provides speedups of up to 72x using single-precision and up to 452x using half-precision compared to conventional CPU algorithms.
We apply our algorithm to real-world data from injection molding manufacturing processes and discuss how found summaries help with steering this specific process to cut costs and reduce the manufacturing of bad parts.
arXiv Detail & Related papers (2021-05-25T15:55:14Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - Frequency Estimation in Data Streams: Learning the Optimal Hashing
Scheme [3.7565501074323224]
We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning.
The proposed approach exploits an observed stream prefix to near-optimally hash elements and compress the target frequency distribution.
We show that the proposed approach outperforms existing approaches by one to two orders of magnitude in terms of its average (per element) estimation error and by 45-90% in terms of its expected magnitude of estimation error.
arXiv Detail & Related papers (2020-07-17T22:15:22Z) - Scaling-up Distributed Processing of Data Streams for Machine Learning [10.581140430698103]
This paper reviews recently developed methods that focus on large-scale distributed optimization in the compute- and bandwidth-limited regime.
It focuses on methods that solve: (i) distributed convex problems, and (ii) distributed principal component analysis, which is a non problem with geometric structure that permits global convergence.
arXiv Detail & Related papers (2020-05-18T16:28:54Z) - How to Solve Fair $k$-Center in Massive Data Models [5.3283669037198615]
We design new streaming and distributed algorithms for the fair $k$-center problem.
Our main contributions are: (a) the first distributed algorithm; and (b) a two-pass streaming algorithm with a provable approximation guarantee.
arXiv Detail & Related papers (2020-02-18T16:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.