Related papers: Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings

Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings

URL: http://arxiv.org/abs/2101.11059v1
Date: Tue, 26 Jan 2021 19:58:30 GMT
Title: Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings
Authors: Kailash Karthik Saravanakumar, Miguel Ballesteros, Muthu Kumar Chandrasekaran, Kathleen McKeown
Abstract summary: We propose a method for online news stream clustering that is a variant of the non-parametric streaming K-means algorithm. Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations. We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings.
Score: 14.225334321146779
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a method for online news stream clustering that is a variant of the non-parametric streaming K-means algorithm. Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations and makes the clustering decision using a neural classifier. The weighted document-cluster similarity model is learned using a novel adaptation of the triplet loss into a linear classification objective. We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings for clustering. Our model achieves a new state-of-the-art on a standard stream clustering dataset of English documents.

Related papers

An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
Self-supervised Latent Space Optimization with Nebula Variational Coding [87.20343320266215]
This paper proposes a variational inference model which leads to a clustered embedding.<n>We introduce additional variables in the latent space, called textbfnebula anchors, that guide the latent variables to form clusters during training.<n>Since each latent feature can be labeled with the closest anchor, we also propose to apply metric learning in a self-supervised way to make the separation between clusters more explicit.
arXiv Detail & Related papers (2025-06-02T08:13:32Z)
An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets [0.0]
We present an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings. We show significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI) The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.
arXiv Detail & Related papers (2025-02-22T08:28:41Z)
k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids. This modification preserves the properties of k-means while offering greater interpretability. We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z)
Self Supervised Correlation-based Permutations for Multi-View Clustering [7.972599673048582]
We propose an end-to-end deep learning-based MVC framework for general data. Our approach involves learning meaningful fused data representations with a novel permutation-based canonical correlation objective. We demonstrate the effectiveness of our model using ten MVC benchmark datasets.
arXiv Detail & Related papers (2024-02-26T08:08:30Z)
Unified Multi-View Orthonormal Non-Negative Graph Based Clustering Framework [74.25493157757943]
We formulate a novel clustering model, which exploits the non-negative feature property and incorporates the multi-view information into a unified joint learning framework. We also explore, for the first time, the multi-model non-negative graph-based approach to clustering data based on deep features.
arXiv Detail & Related papers (2022-11-03T08:18:27Z)
ClusterQ: Semantic Feature Distribution Alignment for Data-Free Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ. To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics. We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z)
Mixture Model Auto-Encoders: Deep Clustering through Dictionary Learning [72.9458277424712]
Mixture Model Auto-Encoders (MixMate) is a novel architecture that clusters data by performing inference on a generative model. We show that MixMate achieves competitive performance compared to state-of-the-art deep clustering algorithms.
arXiv Detail & Related papers (2021-10-10T02:30:31Z)
A Framework for Joint Unsupervised Learning of Cluster-Aware Embedding for Heterogeneous Networks [6.900303913555705]
Heterogeneous Information Network (HIN) embedding refers to the low-dimensional projections of the HIN nodes that preserve the HIN structure and semantics. We propose ours for joint learning of cluster embeddings as well as cluster-aware HIN embedding.
arXiv Detail & Related papers (2021-08-09T11:36:36Z)
Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method. Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features. On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z)
Meta-learning representations for clustering with infinite Gaussian mixture models [39.56814839510978]
We propose a meta-learning method that train neural networks for obtaining representations such that clustering performance improves. The proposed method can cluster unseen unlabeled data using knowledge meta-learned with labeled data that are different from the unlabeled data.
arXiv Detail & Related papers (2021-03-01T02:05:31Z)
Joint Optimization of an Autoencoder for Clustering and Embedding [22.16059261437617]
We present an alternative where the autoencoder and the clustering are learned simultaneously. That simple neural network, referred to as the clustering module, can be integrated into a deep autoencoder resulting in a deep clustering model.
arXiv Detail & Related papers (2020-12-07T14:38:10Z)
Mixing Consistent Deep Clustering [3.5786621294068373]
Good latent representations produce semantically mixed outputs when decoding linears of two latent representations. We propose the Mixing Consistent Deep Clustering method which encourages representations to appear realistic. We show that the proposed method can be added to existing autoencoders to further improve clustering performance.
arXiv Detail & Related papers (2020-11-03T19:47:06Z)
Set Based Stochastic Subsampling [85.5331107565578]
We propose a set-based two-stage end-to-end neural subsampling model that is jointly optimized with an textitarbitrary downstream task network. We show that it outperforms the relevant baselines under low subsampling rates on a variety of tasks including image classification, image reconstruction, function reconstruction and few-shot classification.
arXiv Detail & Related papers (2020-06-25T07:36:47Z)
LSD-C: Linearly Separable Deep Clusters [145.89790963544314]
We present LSD-C, a novel method to identify clusters in an unlabeled dataset. Our method draws inspiration from recent semi-supervised learning practice and proposes to combine our clustering algorithm with self-supervised pretraining and strong data augmentation. We show that our approach significantly outperforms competitors on popular public image benchmarks including CIFAR 10/100, STL 10 and MNIST, as well as the document classification dataset Reuters 10K.
arXiv Detail & Related papers (2020-06-17T17:58:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.