A Clustering-based Framework for Classifying Data Streams
- URL: http://arxiv.org/abs/2106.11823v1
- Date: Tue, 22 Jun 2021 14:37:52 GMT
- Title: A Clustering-based Framework for Classifying Data Streams
- Authors: Xuyang Yan, Abdollah Homaifar, Mrinmoy Sarkar, Abenezer Girma, and
Edward Tunstel
- Abstract summary: We propose a clustering-based data stream classification framework to handle non-stationary data streams.
The proposed method provides statistically better or comparable performance than the existing methods.
- Score: 0.6524460254566904
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The non-stationary nature of data streams strongly challenges traditional
machine learning techniques. Although some solutions have been proposed to
extend traditional machine learning techniques for handling data streams, these
approaches either require an initial label set or rely on specialized design
parameters. The overlap among classes and the labeling of data streams
constitute other major challenges for classifying data streams. In this paper,
we proposed a clustering-based data stream classification framework to handle
non-stationary data streams without utilizing an initial label set. A
density-based stream clustering procedure is used to capture novel concepts
with a dynamic threshold and an effective active label querying strategy is
introduced to continuously learn the new concepts from the data streams. The
sub-cluster structure of each cluster is explored to handle the overlap among
classes. Experimental results and quantitative comparison studies reveal that
the proposed method provides statistically better or comparable performance
than the existing methods.
Related papers
- A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data [0.0]
We present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables.
The method is a variant of the Deterministic Information Bottleneck algorithm which optimally compresses the data while retaining relevant information about the underlying structure.
arXiv Detail & Related papers (2024-07-03T09:06:19Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Dynamic Conceptional Contrastive Learning for Generalized Category
Discovery [76.82327473338734]
Generalized category discovery (GCD) aims to automatically cluster partially labeled data.
Unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories.
One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data.
We propose a Dynamic Conceptional Contrastive Learning framework, which can effectively improve clustering accuracy.
arXiv Detail & Related papers (2023-03-30T14:04:39Z) - Hard Regularization to Prevent Deep Online Clustering Collapse without
Data Augmentation [65.268245109828]
Online deep clustering refers to the joint use of a feature extraction network and a clustering model to assign cluster labels to each new data point or batch as it is processed.
While faster and more versatile than offline methods, online clustering can easily reach the collapsed solution where the encoder maps all inputs to the same point and all are put into a single cluster.
We propose a method that does not require data augmentation, and that, differently from existing methods, regularizes the hard assignments.
arXiv Detail & Related papers (2023-03-29T08:23:26Z) - Exploiting Diversity of Unlabeled Data for Label-Efficient
Semi-Supervised Active Learning [57.436224561482966]
Active learning is a research area that addresses the issues of expensive labeling by selecting the most important samples for labeling.
We introduce a new diversity-based initial dataset selection algorithm to select the most informative set of samples for initial labeling in the active learning setting.
Also, we propose a novel active learning query strategy, which uses diversity-based sampling on consistency-based embeddings.
arXiv Detail & Related papers (2022-07-25T16:11:55Z) - DC-BENCH: Dataset Condensation Benchmark [79.18718490863908]
This work provides the first large-scale standardized benchmark on dataset condensation.
It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods.
The benchmark library is open-sourced to facilitate future research and application.
arXiv Detail & Related papers (2022-07-20T03:54:05Z) - Active Weighted Aging Ensemble for Drifted Data Stream Classification [2.277447144331876]
Concept drift destabilizes the performance of the classification model and seriously degrades its quality.
The proposed method has been evaluated through computer experiments using both real and generated data streams.
The results confirm the high quality of the proposed algorithm over state-of-the-art methods.
arXiv Detail & Related papers (2021-12-19T13:52:53Z) - Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method.
Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z) - Event-Driven News Stream Clustering using Entity-Aware Contextual
Embeddings [14.225334321146779]
We propose a method for online news stream clustering that is a variant of the non-parametric streaming K-means algorithm.
Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations.
We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings.
arXiv Detail & Related papers (2021-01-26T19:58:30Z) - GuCNet: A Guided Clustering-based Network for Improved Classification [15.747227188672088]
We present a novel, and yet a very simple classification technique by leveraging the ease of classifiability of any existing well separable dataset for guidance.
Since the guide dataset which may or may not have any semantic relationship with the experimental dataset, the proposed network tries to embed class-wise features of the challenging dataset to those distinct clusters of the guide set.
arXiv Detail & Related papers (2020-10-11T10:22:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.