A Process for Topic Modelling Via Word Embeddings
- URL: http://arxiv.org/abs/2312.03705v1
- Date: Fri, 6 Oct 2023 15:10:35 GMT
- Title: A Process for Topic Modelling Via Word Embeddings
- Authors: Diego Salda\~na Ulloa
- Abstract summary: This work combines algorithms based on word embeddings, dimensionality reduction, and clustering.
The objective is to obtain topics from a set of unclassified texts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work combines algorithms based on word embeddings, dimensionality
reduction, and clustering. The objective is to obtain topics from a set of
unclassified texts. The algorithm to obtain the word embeddings is the BERT
model, a neural network architecture widely used in NLP tasks. Due to the high
dimensionality, a dimensionality reduction technique called UMAP is used. This
method manages to reduce the dimensions while preserving part of the local and
global information of the original data. K-Means is used as the clustering
algorithm to obtain the topics. Then, the topics are evaluated using the TF-IDF
statistics, Topic Diversity, and Topic Coherence to get the meaning of the
words on the clusters. The results of the process show good values, so the
topic modeling of this process is a viable option for classifying or clustering
texts without labels.
Related papers
- Empowering Interdisciplinary Research with BERT-Based Models: An Approach Through SciBERT-CNN with Topic Modeling [0.0]
This paper introduces a novel approach using the SciBERT model and CNNs to systematically categorize academic abstracts.
The CNN uses convolution and pooling to enhance feature extraction and reduce dimensionality.
arXiv Detail & Related papers (2024-04-16T05:21:47Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Adaptively Clustering Neighbor Elements for Image-Text Generation [78.82346492527425]
We propose a novel Transformer-based image-to-text generation model termed as textbfACF.
ACF adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments.
Experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models.
arXiv Detail & Related papers (2023-01-05T08:37:36Z) - Word Embeddings and Validity Indexes in Fuzzy Clustering [5.063728016437489]
fuzzy-based analysis of various vector representations of words, i.e., word embeddings.
We use two popular fuzzy clustering algorithms on count-based word embeddings, with different methods and dimensionality.
We evaluate results of experiments with various clustering validity indexes to compare different algorithm variation with different embeddings accuracy.
arXiv Detail & Related papers (2022-04-26T18:08:19Z) - A Proposition-Level Clustering Approach for Multi-Document Summarization [82.4616498914049]
We revisit the clustering approach, grouping together propositions for more precise information alignment.
Our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions.
Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets.
arXiv Detail & Related papers (2021-12-16T10:34:22Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Amharic Text Clustering Using Encyclopedic Knowledge with Neural Word
Embedding [0.0]
We propose a system that clusters Amharic text documents using Encyclopedic Knowledge (EK) with neural word embedding.
Test results show that the use of EK with word embedding for document clustering improves the average accuracy over the use of only EK.
arXiv Detail & Related papers (2021-03-31T05:37:33Z) - Event-Driven News Stream Clustering using Entity-Aware Contextual
Embeddings [14.225334321146779]
We propose a method for online news stream clustering that is a variant of the non-parametric streaming K-means algorithm.
Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations.
We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings.
arXiv Detail & Related papers (2021-01-26T19:58:30Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too! [5.819224524813161]
We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
arXiv Detail & Related papers (2020-04-30T16:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.