SDEC: Semantic Deep Embedded Clustering
- URL: http://arxiv.org/abs/2508.15823v1
- Date: Mon, 18 Aug 2025 02:08:35 GMT
- Title: SDEC: Semantic Deep Embedded Clustering
- Authors: Mohammad Wali Ur Rahman, Ric Nevarez, Lamia Tasnim Mim, Salim Hariri,
- Abstract summary: This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework.<n>SDEC preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder.<n>The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora.
- Score: 0.31498833540989407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.
Related papers
- Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models [64.58262227709842]
ARISE (Attention-weighted Representation with Integrated Semantic Embeddings) is presented.<n>It builds semantic-aware representations that complement the metric space of categorical data for accurate clustering.<n>Experiments on eight benchmark datasets demonstrate consistent improvements over seven representative counterparts.
arXiv Detail & Related papers (2026-01-03T11:37:46Z) - Enhancing Retrieval-Augmented Generation with Topic-Enriched Embeddings: A Hybrid Approach Integrating Traditional NLP Techniques [0.0]
This work proposes topic-enriched embeddings that integrate term-based signals and topic structure with contextual sentence embeddings.<n>By jointly capturing term-level and topic-level semantics, topic-enriched embeddings improve semantic clustering, increase retrieval precision, and reduce computational burden.
arXiv Detail & Related papers (2025-12-31T13:43:57Z) - An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z) - Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking [0.9968037829925942]
This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering.<n>During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations.<n> Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
arXiv Detail & Related papers (2025-07-14T05:21:58Z) - A Dynamic Framework for Semantic Grouping of Common Data Elements (CDE) Using Embeddings and Clustering [0.782496834711349]
This research aims to develop a dynamic and scalable framework to facilitate harmonization of Common Data Elements (CDEs) across heterogeneous biomedical datasets.<n>Our methodology leverages Large Language Models (LLMs) for context-aware text embeddings that convert CDEs into dense vectors capturing semantic relationships and patterns.
arXiv Detail & Related papers (2025-06-02T18:43:37Z) - An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets [0.0]
We present an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings.<n>We show significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI)<n>The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.
arXiv Detail & Related papers (2025-02-22T08:28:41Z) - Autoencoded UMAP-Enhanced Clustering for Unsupervised Learning [49.1574468325115]
We propose a novel approach to unsupervised learning by constructing a non-linear embedding of the data into a low-dimensional space followed by any conventional clustering algorithm.<n>The embedding promotes clusterability of the data and is comprised of two mappings: the encoder of an autoencoder neural network and the output of UMAP algorithm.<n>When applied to MNIST data, AUEC significantly outperforms the state-of-the-art techniques in terms of clustering accuracy.
arXiv Detail & Related papers (2025-01-13T22:30:38Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs [1.6575279044457722]
This paper proposes a Clustering, Labeling, then Augmenting framework that enhances performance in Semi-Supervised Text Classification tasks.<n>Unlike traditional SSTC approaches, this framework employs clustering to select representative "landmarks" for labeling.<n> Empirical results show that even in complex text document classification scenarios involving over 100 categories, our method achieves state-of-the-art accuracies of 95.41% on the Reuters dataset and 82.43% on the Web of Science dataset.
arXiv Detail & Related papers (2024-11-09T13:17:39Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Federated Contrastive Learning for Personalized Semantic Communication [55.46383524190467]
We design a federated contrastive learning framework aimed at supporting personalized semantic communication.
FedCL enables collaborative training of local semantic encoders across multiple clients and a global semantic decoder owned by the base station.
To tackle the semantic imbalance issue arising from heterogeneous datasets across distributed clients, we employ contrastive learning to train a semantic centroid generator.
arXiv Detail & Related papers (2024-06-13T14:45:35Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - Compositional Generalization for Data-to-Text Generation [86.79706513098104]
We propose a novel model that addresses compositional generalization by clustering predicates into groups.
Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time.
It significantly outperforms T5baselines across all evaluation metrics.
arXiv Detail & Related papers (2023-12-05T13:23:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.