Challenges in Benchmarking Stream Learning Algorithms with Real-world
Data
- URL: http://arxiv.org/abs/2005.00113v2
- Date: Tue, 30 Jun 2020 15:41:10 GMT
- Title: Challenges in Benchmarking Stream Learning Algorithms with Real-world
Data
- Authors: Vinicius M. A. Souza, Denis M. dos Reis, Andre G. Maletzke, Gustavo E.
A. P. A. Batista
- Abstract summary: Streaming data are increasingly present in real-world applications such as sensor measurements, satellite data feed, stock market, and financial data.
The data stream mining community still faces some primary challenges and difficulties related to the comparison and evaluation of new proposals.
We propose a new public data repository for benchmarking stream algorithms with real-world data.
- Score: 2.861782696432711
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Streaming data are increasingly present in real-world applications such as
sensor measurements, satellite data feed, stock market, and financial data. The
main characteristics of these applications are the online arrival of data
observations at high speed and the susceptibility to changes in the data
distributions due to the dynamic nature of real environments. The data stream
mining community still faces some primary challenges and difficulties related
to the comparison and evaluation of new proposals, mainly due to the lack of
publicly available non-stationary real-world datasets. The comparison of stream
algorithms proposed in the literature is not an easy task, as authors do not
always follow the same recommendations, experimental evaluation procedures,
datasets, and assumptions. In this paper, we mitigate problems related to the
choice of datasets in the experimental evaluation of stream classifiers and
drift detectors. To that end, we propose a new public data repository for
benchmarking stream algorithms with real-world data. This repository contains
the most popular datasets from literature and new datasets related to a highly
relevant public health problem that involves the recognition of disease vector
insects using optical sensors. The main advantage of these new datasets is the
prior knowledge of their characteristics and patterns of changes to evaluate
new adaptive algorithm proposals adequately. We also present an in-depth
discussion about the characteristics, reasons, and issues that lead to
different types of changes in data distribution, as well as a critical review
of common problems concerning the current benchmark datasets available in the
literature.
Related papers
- Online Model-based Anomaly Detection in Multivariate Time Series: Taxonomy, Survey, Research Challenges and Future Directions [0.017476232824732776]
Time-series anomaly detection plays an important role in engineering processes.
This survey introduces a novel taxonomy where a distinction between online and offline, and training and inference is made.
It presents the most popular data sets and evaluation metrics used in the literature, as well as a detailed analysis.
arXiv Detail & Related papers (2024-08-07T13:01:10Z) - OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework [21.87740178652843]
Causal discovery offers a promising approach to improve transparency and reliability.
We propose a flexible evaluation framework with metrics for evaluating differences in causal structures and causal effects.
We introduce the Open Causal Discovery Benchmark (OCDB), based on real data, to promote fair comparisons and drive optimization of algorithms.
arXiv Detail & Related papers (2024-06-07T03:09:22Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - Quality In / Quality Out: Assessing Data quality in an Anomaly Detection
Benchmark [0.13764085113103217]
We show that relatively minor modifications on the same benchmark dataset (UGR'16, a flow-based real-traffic dataset for anomaly detection) cause significantly more impact on model performance than the specific Machine Learning technique considered.
Our findings illustrate the need to devote more attention into (automatic) data quality assessment and optimization techniques in the context of autonomous networks.
arXiv Detail & Related papers (2023-05-31T12:03:12Z) - A Survey of Dataset Refinement for Problems in Computer Vision Datasets [11.45536223418548]
Large-scale datasets have played a crucial role in the advancement of computer vision.
They often suffer from problems such as class imbalance, noisy labels, dataset bias, or high resource costs.
Various data-centric solutions have been proposed to solve the dataset problems.
They improve the quality of datasets by re-organizing them, which we call dataset refinement.
arXiv Detail & Related papers (2022-10-21T03:58:43Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation.
It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods.
The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z) - Domain Adaptative Causality Encoder [52.779274858332656]
We leverage the characteristics of dependency trees and adversarial learning to address the tasks of adaptive causality identification and localisation.
We present a new causality dataset, namely MedCaus, which integrates all types of causality in the text.
arXiv Detail & Related papers (2020-11-27T04:14:55Z) - Comparative Analysis of Extreme Verification Latency Learning Algorithms [3.3439097577935213]
This paper is a comprehensive survey and comparative analysis of some of the EVL algorithms to point out the weaknesses and strengths of different approaches.
This work is a very first effort to provide a review of some of the existing algorithms in this field to the research community.
arXiv Detail & Related papers (2020-11-26T16:34:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.