Related papers: Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth

URL: http://arxiv.org/abs/2412.03611v2
Date: Wed, 18 Dec 2024 07:01:15 GMT
Title: Learning-based Sketches for Frequency Estimation in Data Streams without Ground Truth
Authors: Xinyu Yuan, Yan Qiao, Meng Li, Zhenchun Wei, Cuiying Feng,
Abstract summary: We propose a more practical learning-based estimation framework namely UCL-sketch.<n>Online training via equivalent learning without ground truth, and highly scalable architecture with logical estimation buckets.<n>Results demonstrate that our method greatly outperforms existing state-of-the-art sketches regarding per-key accuracy and distribution.
Score: 8.643366221221351
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Estimating the frequency of items on the high-volume, fast data stream has been extensively studied in many areas, such as database and network measurement. Traditional sketch algorithms only allow to give very rough estimates with limited memory cost, whereas some learning-augmented algorithms have been proposed recently, their offline framework requires actual frequencies that are challenging to access in general for training, and speed is too slow for real-time processing, despite the still coarse-grained accuracy. To this end, we propose a more practical learning-based estimation framework namely UCL-sketch, by following the line of equation-based sketch to estimate per-key frequencies. In a nutshell, there are two key techniques: online training via equivalent learning without ground truth, and highly scalable architecture with logical estimation buckets. We implemented experiments on both real-world and synthetic datasets. The results demonstrate that our method greatly outperforms existing state-of-the-art sketches regarding per-key accuracy and distribution, while preserving resource efficiency. Our code is attached in the supplementary material, and will be made publicly available at https://github.com/Y-debug-sys/UCL-sketch.

Related papers

PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning [17.230781041043823]
We propose a novel prompt-based method for online continual learning (OCL) that includes 4 main components.<n>Our proposed method achieves significantly higher performance than the current SOTAs in CIFAR100, ImageNet-R, ImageNet-A, and CUB dataset.
arXiv Detail & Related papers (2025-07-16T15:04:46Z)
SONNET: Enhancing Time Delay Estimation by Leveraging Simulated Audio [17.811771707446926]
We show that learning based methods can, even based on synthetic data, significantly outperform GCC-PHAT on novel real world data. We provide our trained model, SONNET, which is runnable in real-time and works on novel data out of the box for many real data applications.
arXiv Detail & Related papers (2024-11-20T10:23:21Z)
Learning-Based Heavy Hitters and Flow Frequency Estimation in Streams [9.22255012731159]
We present the first learned competing-counter-based algorithm, called LSS, for identifying heavy hitters, top k, and flow frequency estimation. Our evaluation demonstrates that LSS can enhance the accuracy and efficiency of Space Saving in identifying heavy hitters, top k, and estimating flow frequencies.
arXiv Detail & Related papers (2024-06-24T02:31:00Z)
Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution [62.71425232332837]
We show that training amortized models with noisy labels is inexpensive and surprisingly effective. This approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches.
arXiv Detail & Related papers (2024-01-29T03:42:37Z)
Benchmarking Neural Network Training Algorithms [52.890134877995195]
Training algorithms are an essential part of every deep learning pipeline.<n>As a community, we are unable to reliably identify training algorithm improvements.<n>We introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware.
arXiv Detail & Related papers (2023-06-12T15:21:02Z)
Learnability with Time-Sharing Computational Resource Concerns [65.268245109828]
We present a theoretical framework that takes into account the influence of computational resources in learning theory. This framework can be naturally applied to stream learning where the incoming data streams can be potentially endless. It may also provide a theoretical perspective for the design of intelligent supercomputing operating systems.
arXiv Detail & Related papers (2023-05-03T15:54:23Z)
Real-Time Evaluation in Online Continual Learning: A New Hope [104.53052316526546]
We evaluate current Continual Learning (CL) methods with respect to their computational costs. A simple baseline outperforms state-of-the-art CL methods under this evaluation. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical.
arXiv Detail & Related papers (2023-02-02T12:21:10Z)
Learning with Neighbor Consistency for Noisy Labels [69.83857578836769]
We present a method for learning from noisy labels that leverages similarities between training examples in feature space. We evaluate our method on datasets evaluating both synthetic (CIFAR-10, CIFAR-100) and realistic (mini-WebVision, Clothing1M, mini-ImageNet-Red) noise.
arXiv Detail & Related papers (2022-02-04T15:46:27Z)
Accelerating Deep Learning with Dynamic Data Pruning [0.0]
Deep learning has become prohibitively costly, requiring access to powerful computing systems to train state-of-the-art networks. Previous work, such as forget scores and GraNd/EL2N scores, identify important samples within a full dataset and pruning the remaining samples, thereby reducing the iterations per epoch. We propose two algorithms, based on reinforcement learning techniques, to dynamically prune samples and achieve even higher accuracy than the random dynamic method.
arXiv Detail & Related papers (2021-11-24T16:47:34Z)
Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation [9.005280130480308]
We propose a speed-up upgraded version of the classic iterative thresholding algorithm, that produces a good approximation of the convolutional sparse code within 2-5 iterations. The performance of the proposed solution is demonstrated via the seismic inversion problem in both synthetic and real data scenarios.
arXiv Detail & Related papers (2021-06-29T12:19:07Z)
Fast accuracy estimation of deep learning based multi-class musical source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z)
Low-Rank Robust Online Distance/Similarity Learning based on the Rescaled Hinge Loss [0.34376560669160383]
Existing online methods usually assume training triplets or pairwise constraints are exist in advance. We formulate the online Distance-Similarity learning problem with the robust Rescaled hinge loss function. The proposed model is rather general and can be applied to any PA-based online Distance-Similarity algorithm.
arXiv Detail & Related papers (2020-10-07T08:38:34Z)
Temporal Calibrated Regularization for Robust Noisy Label Learning [60.90967240168525]
Deep neural networks (DNNs) exhibit great success on many tasks with the help of large-scale well annotated datasets. However, labeling large-scale data can be very costly and error-prone so that it is difficult to guarantee the annotation quality. We propose a Temporal Calibrated Regularization (TCR) in which we utilize the original labels and the predictions in the previous epoch together.
arXiv Detail & Related papers (2020-07-01T04:48:49Z)
On Coresets for Support Vector Machines [61.928187390362176]
A coreset is a small, representative subset of the original data points. We show that our algorithm can be used to extend the applicability of any off-the-shelf SVM solver to streaming, distributed, and dynamic data settings.
arXiv Detail & Related papers (2020-02-15T23:25:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.