Topic Extraction of Crawled Documents Collection using Correlated Topic
Model in MapReduce Framework
- URL: http://arxiv.org/abs/2001.01669v1
- Date: Mon, 6 Jan 2020 17:09:21 GMT
- Title: Topic Extraction of Crawled Documents Collection using Correlated Topic
Model in MapReduce Framework
- Authors: Mi Khine Oo and May Aye Khine
- Abstract summary: Correlated Topic Model with variational Expectation-Maximization algorithm is implemented in MapReduce framework.
The proposed approach utilizes the dataset crawled from the public digital library.
From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The tremendous increase in the amount of available research documents impels
researchers to propose topic models to extract the latent semantic themes of a
documents collection. However, how to extract the hidden topics of the
documents collection has become a crucial task for many topic model
applications. Moreover, conventional topic modeling approaches suffer from the
scalability problem when the size of documents collection increases. In this
paper, the Correlated Topic Model with variational Expectation-Maximization
algorithm is implemented in MapReduce framework to solve the scalability
problem. The proposed approach utilizes the dataset crawled from the public
digital library. In addition, the full-texts of the crawled documents are
analysed to enhance the accuracy of MapReduce CTM. The experiments are
conducted to demonstrate the performance of the proposed algorithm. From the
evaluation, the proposed approach has a comparable performance in terms of
topic coherences with LDA implemented in MapReduce framework.
Related papers
- How Does Generative Retrieval Scale to Millions of Passages? [68.98628807288972]
We conduct the first empirical study of generative retrieval techniques across various corpus scales.
We scale generative retrieval to millions of passages with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters.
While generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge.
arXiv Detail & Related papers (2023-05-19T17:33:38Z) - Improving Contextualized Topic Models with Negative Sampling [3.708656266586146]
We propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics.
In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document.
arXiv Detail & Related papers (2023-03-27T07:28:46Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - Augmenting Document Representations for Dense Retrieval with
Interpolation and Perturbation [49.940525611640346]
Document Augmentation for dense Retrieval (DAR) framework augments the representations of documents with their Dense Augmentation and perturbations.
We validate the performance of DAR on retrieval tasks with two benchmark datasets, showing that the proposed DAR significantly outperforms relevant baselines on the dense retrieval of both the labeled and unlabeled documents.
arXiv Detail & Related papers (2022-03-15T09:07:38Z) - Community-Detection via Hashtag-Graphs for Semi-Supervised NMF Topic
Models [0.0]
This paper outlines a novel approach on how to integrate topic structures of hashtag graphs into the estimation of topic models.
By applying this approach on recently streamed Twitter data it will be seen that this procedure actually leads to more intuitive and humanly interpretable topics.
arXiv Detail & Related papers (2021-11-17T12:52:16Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Improving Document Representations by Generating Pseudo Query Embeddings
for Dense Retrieval [11.465218502487959]
We design a method to mimic the queries on each of the documents by an iterative clustering process.
We also optimize the matching function with a two-step score calculation procedure.
Experimental results on several popular ranking and QA datasets show that our model can achieve state-of-the-art results.
arXiv Detail & Related papers (2021-05-08T05:28:24Z) - Efficient Clustering from Distributions over Topics [0.0]
We present an approach that relies on the results of a topic modeling algorithm over documents in a collection as a means to identify smaller subsets of documents where the similarity function can be computed.
This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications.
arXiv Detail & Related papers (2020-12-15T10:52:19Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z) - Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too! [5.819224524813161]
We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
arXiv Detail & Related papers (2020-04-30T16:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.