Related papers: Data Stream Clustering: A Review

Data Stream Clustering: A Review

URL: http://arxiv.org/abs/2007.10781v1
Date: Thu, 16 Jul 2020 20:35:09 GMT
Title: Data Stream Clustering: A Review
Authors: Alaettin Zubaro\u{g}lu and Volkan Atalay
Abstract summary: Clustering is one of the most suitable methods for real-time data stream processing. We review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. We indicate popular data stream repositories and datasets, stream processing tools and platforms.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Number of connected devices is steadily increasing and these devices continuously generate data streams. Real-time processing of data streams is arousing interest despite many challenges. Clustering is one of the most suitable methods for real-time data stream processing, because it can be applied with less prior information about the data and it does not need labeled instances. However, data stream clustering differs from traditional clustering in many aspects and it has several challenging issues. Here, we provide information regarding the concepts and common characteristics of data streams, such as concept drift, data structures for data streams, time window models and outlier detection. We comprehensively review recent data stream clustering algorithms and analyze them in terms of the base clustering technique, computational complexity and clustering accuracy. A comparison of these algorithms is given along with still open problems. We indicate popular data stream repositories and datasets, stream processing tools and platforms. Open problems about data stream clustering are also discussed.

Related papers

TNStream: Applying Tightest Neighbors to Micro-Clusters to Define Multi-Density Clusters in Streaming Data [1.2016321065590192]
This paper proposes a clustering algorithm based on the novel concept of Tightest Neighbors and introduces a data stream clustering theory based on the Skeleton Set. Based on these theories, this paper develops a new method, TNStream, a fully online algorithm. Experimental results demonstrate its effectiveness in improving clustering quality for multi-density data and validate the proposed data stream clustering theory.
arXiv Detail & Related papers (2025-05-01T07:15:20Z)
Incremental Gaussian Mixture Clustering for Data Streams [0.08192907805418582]
We present and demonstrate effective working of an algorithm to find clusters and anomalous data points in a streaming datasets. As the clusters are formed we also identify anomalous datapoints that show up far away from all known clusters.
arXiv Detail & Related papers (2024-12-10T06:15:14Z)
DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW) DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z)
An Algorithm for Streaming Differentially Private Data [7.726042106665366]
We derive an algorithm for differentially private synthetic streaming data generation, especially curated towards spatial datasets. The utility of our algorithm is verified on both real-world and simulated datasets.
arXiv Detail & Related papers (2024-01-26T00:32:31Z)
Contrastive Continual Multi-view Clustering with Filtered Structural Fusion [57.193645780552565]
Multi-view clustering thrives in applications where views are collected in advance. It overlooks scenarios where data views are collected sequentially, i.e., real-time data. Some methods are proposed to handle it but are trapped in a stability-plasticity dilemma. We propose Contrastive Continual Multi-view Clustering with Filtered Structural Fusion.
arXiv Detail & Related papers (2023-09-26T14:18:29Z)
Hard Regularization to Prevent Deep Online Clustering Collapse without Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed. While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster. We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z)
DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods. The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z)
Improved Multi-objective Data Stream Clustering with Time and Memory Optimization [0.0]
This paper introduces a new data stream clustering method (IMOC-Stream) It uses two different objective functions to capture different aspects of the data. The experiments show the ability of our method to partition the data stream in arbitrarily shaped, compact, and well-separated clusters.
arXiv Detail & Related papers (2022-01-13T17:05:56Z)
Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points. We provide implementable differentially private clustering algorithms that provide utility when the data is "easy" We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z)
A Clustering-based Framework for Classifying Data Streams [0.6524460254566904]
We propose a clustering-based data stream classification framework to handle non-stationary data streams. The proposed method provides statistically better or comparable performance than the existing methods.
arXiv Detail & Related papers (2021-06-22T14:37:52Z)
Scaling-up Distributed Processing of Data Streams for Machine Learning [10.581140430698103]
This paper reviews recently developed methods that focus on large-scale distributed optimization in the compute- and bandwidth-limited regime. It focuses on methods that solve: (i) distributed convex problems, and (ii) distributed principal component analysis, which is a non problem with geometric structure that permits global convergence.
arXiv Detail & Related papers (2020-05-18T16:28:54Z)
A Novel Incremental Clustering Technique with Concept Drift Detection [2.790947019327459]
Traditional static clustering algorithms are not suitable for dynamic datasets. We propose an efficient incremental clustering algorithm called UIClust. We evaluate the performance of UIClust by comparing it with a recently published, high-quality incremental clustering algorithm.
arXiv Detail & Related papers (2020-03-30T05:20:35Z)
Distributed Learning in the Non-Convex World: From Batch to Streaming Data, and Beyond [73.03743482037378]
Distributed learning has become a critical direction of the massively connected world envisioned by many. This article discusses four key elements of scalable distributed processing and real-time data computation problems. Practical issues and future research will also be discussed.
arXiv Detail & Related papers (2020-01-14T14:11:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.