Influence of various text embeddings on clustering performance in NLP
- URL: http://arxiv.org/abs/2305.03144v1
- Date: Thu, 4 May 2023 20:53:19 GMT
- Title: Influence of various text embeddings on clustering performance in NLP
- Authors: Rohan Saha
- Abstract summary: A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups.
In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the advent of e-commerce platforms, reviews are crucial for customers to
assess the credibility of a product. The star ratings do not always match the
review text written by the customer. For example, a three star rating (out of
five) may be incongruous with the review text, which may be more suitable for a
five star review. A clustering approach can be used to relabel the correct star
ratings by grouping the text reviews into individual groups. In this work, we
explore the task of choosing different text embeddings to represent these
reviews and also explore the impact the embedding choice has on the performance
of various classes of clustering algorithms. We use contextual (BERT) and
non-contextual (Word2Vec) text embeddings to represent the text and measure
their impact of three classes on clustering algorithms - partitioning based
(KMeans), single linkage agglomerative hierarchical, and density based (DBSCAN
and HDBSCAN), each with various experimental settings. We use the silhouette
score, adjusted rand index score, and cluster purity score metrics to evaluate
the performance of the algorithms and discuss the impact of different
embeddings on the clustering performance. Our results indicate that the type of
embedding chosen drastically affects the performance of the algorithm, the
performance varies greatly across different types of clustering algorithms, no
embedding type is better than the other, and DBSCAN outperforms KMeans and
single linkage agglomerative clustering but also labels more data points as
outliers. We provide a thorough comparison of the performances of different
algorithms and provide numerous ideas to foster further research in the domain
of text clustering.
Related papers
- Enhancing Affinity Propagation for Improved Public Sentiment Insights [0.0]
This project introduces an approach using unsupervised learning techniques to analyze sentiment.
AP clustering groups text data based on natural patterns, without needing predefined cluster numbers.
To enhance performance, AP is combined with Agglomerative Hierarchical Clustering.
arXiv Detail & Related papers (2024-10-12T19:20:33Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - CEIL: A General Classification-Enhanced Iterative Learning Framework for
Text Clustering [16.08402937918212]
We propose a novel Classification-Enhanced Iterative Learning framework for short text clustering.
In each iteration, we first adopt a language model to retrieve the initial text representations.
After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information.
Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration.
arXiv Detail & Related papers (2023-04-20T14:04:31Z) - A framework for benchmarking clustering algorithms [2.900810893770134]
Clustering algorithms can be tested on a variety of benchmark problems.
Many research papers and graduate theses consider only a small number of datasets.
We have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms.
arXiv Detail & Related papers (2022-09-20T06:10:41Z) - Hybrid Multisource Feature Fusion for the Text Clustering [5.5586788751870175]
The text clustering technique is an unsupervised text mining method which are used to partition a huge amount of text documents into groups.
We present a hybrid multisource feature fusion (HMFF) framework comprising three components, feature representation of multimodel, mutual similarity matrices and feature fusion.
Our HMFF framework outperforms other recently published algorithms on 7 of 11 public benchmark datasets and has the leading performance on the rest 4 benchmark datasets as well.
arXiv Detail & Related papers (2021-08-24T19:32:09Z) - Comprehensive Studies for Arbitrary-shape Scene Text Detection [78.50639779134944]
We propose a unified framework for the bottom-up based scene text detection methods.
Under the unified framework, we ensure the consistent settings for non-core modules.
With the comprehensive investigations and elaborate analyses, it reveals the advantages and disadvantages of previous models.
arXiv Detail & Related papers (2021-07-25T13:18:55Z) - The Three Ensemble Clustering (3EC) Algorithm for Pattern Discovery in
Unsupervised Learning [1.0465883970481493]
The 'Three Ensemble Clustering 3EC' algorithm classifies unlabeled data into quality clusters as a part of unsupervised learning.
Each partitioned cluster is considered to be a new data set and is a candidate to explore the most optimal algorithm.
The users can experiment with different sets of stopping criteria and choose the most'sensible group' of quality clusters.
arXiv Detail & Related papers (2021-07-08T10:15:18Z) - Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method.
Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.